Spaces:

DeepLearningAI
/

quiz-generator-v3

Sleeping

File size: 57,251 Bytes

217abc3

# AI Course Assessment Generator - Functionality Report

## Table of Contents
1. [Overview](#overview)
2. [System Architecture](#system-architecture)
3. [Data Models](#data-models)
4. [Application Entry Point](#application-entry-point)
5. [User Interface Structure](#user-interface-structure)
6. [Complete Workflow](#complete-workflow)
7. [Detailed Component Functionality](#detailed-component-functionality)
8. [Quality Standards and Prompts](#quality-standards-and-prompts)

---

## Overview

The AI Course Assessment Generator is a sophisticated educational tool that automates the creation of learning objectives and multiple-choice questions from course materials. It leverages OpenAI's language models with structured output generation to produce high-quality educational assessments that adhere to specified quality standards and Bloom's Taxonomy levels.

### Key Capabilities
- **Multi-format Content Processing**: Accepts `.vtt`, `.srt` (subtitle files), and `.ipynb` (Jupyter notebooks)
- **AI-Powered Generation**: Uses OpenAI's GPT models with configurable parameters
- **Quality Assurance**: Implements LLM-based quality assessment and ranking
- **Source Tracking**: Maintains XML-tagged references from source materials to generated content
- **Iterative Improvement**: Supports feedback-based regeneration and enhancement
- **Parallel Processing**: Generates questions concurrently for improved performance

---

## System Architecture

### Architectural Patterns

#### 1. **Orchestrator Pattern**
Both `LearningObjectiveGenerator` and `QuizGenerator` act as orchestrators that coordinate calls to specialized generation functions rather than implementing generation logic directly.

#### 2. **Modular Prompt System**
The `prompts/` directory contains reusable prompt components that are imported and combined in generation modules, allowing for consistent quality standards across different generation tasks.

#### 3. **Structured Output Generation**
All LLM interactions use Pydantic models with the `instructor` library to ensure consistent, validated output formats using OpenAI's structured output API.

#### 4. **Source Tracking via XML Tags**
Content is wrapped in XML tags (e.g., `<source file="example.ipynb">content</source>`) throughout the pipeline to maintain traceability from source files to generated questions.

### Technology Stack
- **Python 3.8+**
- **Gradio 5.29.0+**: Web-based UI framework
- **Pydantic 2.8.0+**: Data validation and schema management
- **OpenAI 1.52.0+**: LLM API integration
- **Instructor 1.7.9+**: Structured output generation
- **nbformat 5.9.2**: Jupyter notebook parsing
- **python-dotenv 1.0.0**: Environment variable management

---

## Data Models

### Learning Objectives Progression

The system uses a hierarchical progression of learning objective models:

#### 1. **BaseLearningObjectiveWithoutCorrectAnswer**
```python
- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
```
Initial generation without correct answers.

#### 2. **BaseLearningObjective**
```python
- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
- correct_answer: str
```
Base objectives with correct answers added.

#### 3. **LearningObjective**
```python
- id: int
- learning_objective: str
- source_reference: Union[List[str], str]
- correct_answer: str
- incorrect_answer_options: Union[List[str], str]
- in_group: Optional[bool]
- group_members: Optional[List[int]]
- best_in_group: Optional[bool]
```
Enhanced with incorrect answer suggestions and grouping metadata.

#### 4. **GroupedLearningObjective**
```python
(All fields from LearningObjective)
- in_group: bool (required)
- group_members: List[int] (required)
- best_in_group: bool (required)
```
Fully grouped and ranked objectives.

### Question Models Progression

#### 1. **MultipleChoiceOption**
```python
- option_text: str
- is_correct: bool
- feedback: str
```

#### 2. **MultipleChoiceQuestion**
```python
- id: int
- question_text: str
- options: List[MultipleChoiceOption]
- learning_objective_id: int
- learning_objective: str
- correct_answer: str
- source_reference: Union[List[str], str]
- judge_feedback: Optional[str]
- approved: Optional[bool]
```

#### 3. **RankedMultipleChoiceQuestion**
```python
(All fields from MultipleChoiceQuestion)
- rank: int
- ranking_reasoning: str
- in_group: bool
- group_members: List[int]
- best_in_group: bool
```

#### 4. **Assessment**
```python
- learning_objectives: List[LearningObjective]
- questions: List[RankedMultipleChoiceQuestion]
```
Final output containing both objectives and questions.

### Configuration Models

#### **MODELS**
Available OpenAI models: `["o3-mini", "o1", "gpt-4.1", "gpt-4o", "gpt-4o-mini", "gpt-4", "gpt-3.5-turbo", "gpt-5", "gpt-5-mini", "gpt-5-nano"]`

#### **TEMPERATURE_UNAVAILABLE**
Dictionary mapping models to temperature availability (some models like o1, o3-mini, and gpt-5 variants don't support temperature settings).

---

## Application Entry Point

### `app.py`
The root-level entry point that:
1. Loads environment variables from `.env` file
2. Checks for `OPENAI_API_KEY` presence
3. Creates the Gradio UI via `ui.app.create_ui()`
4. Launches the web interface at `http://127.0.0.1:7860`

```python
# Workflow:
load_dotenv() → Check API key → create_ui() → app.launch()
```

---

## User Interface Structure

### `ui/app.py` - Gradio Interface

The UI is organized into **3 main tabs**:

#### **Tab 1: Generate Learning Objectives**

**Input Components:**
- File uploader (accepts `.ipynb`, `.vtt`, `.srt`)
- Number of objectives per run (slider: 1-20, default: 3)
- Number of generation runs (dropdown: 1-5, default: 3)
- Model selection (dropdown, default: "gpt-5")
- Incorrect answer model selection (dropdown, default: "gpt-5")
- Temperature setting (dropdown: 0.0-1.0, default: 1.0)
- Generate button
- Feedback input textbox
- Regenerate button

**Output Components:**
- Status textbox
- Best-in-Group Learning Objectives (JSON)
- All Grouped Learning Objectives (JSON)
- Raw Ungrouped Learning Objectives (JSON) - for debugging

**Event Handler:** `process_files()` from `objective_handlers.py`

#### **Tab 2: Generate Questions**

**Input Components:**
- Learning Objectives JSON (auto-populated from Tab 1)
- Model selection
- Temperature setting
- Number of question generation runs (slider: 1-5, default: 1)
- Generate Questions button

**Output Components:**
- Status textbox
- Ranked Best-in-Group Questions (JSON)
- All Grouped Questions (JSON)
- Formatted Quiz (human-readable format)

**Event Handler:** `generate_questions()` from `question_handlers.py`

#### **Tab 3: Propose/Edit Question**

**Input Components:**
- Question guidance/feedback textbox
- Model selection
- Temperature setting
- Generate Question button

**Output Components:**
- Status textbox
- Generated Question (JSON)

**Event Handler:** `propose_question_handler()` from `feedback_handlers.py`

---

## Complete Workflow

### Phase 1: File Upload and Content Processing

#### Step 1.1: File Upload
User uploads one or more files (`.vtt`, `.srt`, `.ipynb`) through the Gradio interface.

#### Step 1.2: File Path Extraction (`objective_handlers._extract_file_paths()`)
```python
# Handles different input formats:
- List of file paths
- Single file path string
- File objects with .name attribute
```

#### Step 1.3: Content Processing (`ui/content_processor.py`)

**For Subtitle Files (`.vtt`, `.srt`):**
```python
1. Read file with UTF-8 encoding
2. Split into lines
3. Filter out:
   - Empty lines
   - Numeric timestamp indicators
   - Lines containing '-->' (timestamps)
   - 'WEBVTT' header lines
4. Combine remaining text lines
5. Wrap in XML tags: <source file='filename.vtt'>content</source>
```

**For Jupyter Notebooks (`.ipynb`):**
```python
1. Validate JSON format
2. Parse with nbformat.read()
3. Extract from cells:
   - Markdown cells: [Markdown]\n{content}
   - Code cells: [Code]\n```python\n{content}\n```
4. Combine all cell content
5. Wrap in XML tags: <source file='filename.ipynb'>content</source>
```

**Error Handling:**
- Invalid JSON: Wraps raw content in code blocks
- Parsing failures: Falls back to plain text extraction
- All errors logged to console

#### Step 1.4: State Storage
Processed content stored in global state (`ui/state.py`):
```python
processed_file_contents = [tagged_content_1, tagged_content_2, ...]
```

### Phase 2: Learning Objective Generation

#### Step 2.1: Multi-Run Base Generation

**Process:** `objective_handlers._generate_multiple_runs()`

For each run (user-specified, typically 3 runs):

1. **Call:** `QuizGenerator.generate_base_learning_objectives()`
2. **Workflow:**
   ```
   generate_base_learning_objectives()
     ↓
   generate_base_learning_objectives_without_correct_answers()
     → Creates prompt with:
        - BASE_LEARNING_OBJECTIVES_PROMPT
        - BLOOMS_TAXONOMY_LEVELS
        - LEARNING_OBJECTIVE_EXAMPLES_WITHOUT_ANSWERS
        - Combined file contents
     → Calls OpenAI API with structured output
     → Returns List[BaseLearningObjectiveWithoutCorrectAnswer]
     ↓
   generate_correct_answers_for_objectives()
     → For each objective:
        - Creates prompt with objective and course content
        - Calls OpenAI API (unstructured text response)
        - Extracts correct answer
     → Returns List[BaseLearningObjective]
   ```

3. **ID Assignment:**
   ```python
   # Temporary IDs by run:
   Run 1: 1001, 1002, 1003
   Run 2: 2001, 2002, 2003
   Run 3: 3001, 3002, 3003
   ```

4. **Aggregation:**
   All objectives from all runs combined into single list.

**Example:** 3 runs × 3 objectives = 9 total base objectives

#### Step 2.2: Grouping and Ranking

**Process:** `objective_handlers._group_base_objectives_add_incorrect_answers()`

**Step 2.2.1: Group Base Objectives**
```python
QuizGenerator.group_base_learning_objectives()
  ↓
learning_objective_generator/grouping_and_ranking.py
  → group_base_learning_objectives()
```

**Grouping Logic:**
1. Creates prompt containing:
   - Original generation criteria
   - All base objectives with IDs
   - Course content for context
   - Grouping instructions

2. **Special Rule:** All objectives with IDs ending in 1 (1001, 2001, 3001) are grouped together and ONE is marked as best-in-group (this becomes the primary/first objective)

3. **LLM Call:**
   - Model: `gpt-5-mini`
   - Response format: `GroupedBaseLearningObjectivesResponse`
   - Returns: Grouped objectives with metadata

4. **Output Structure:**
   ```python
   {
     "all_grouped": [all objectives with group metadata],
     "best_in_group": [objectives marked as best in their groups]
   }
   ```

**Step 2.2.2: ID Reassignment** (`_reassign_objective_ids()`)
```python
1. Find best objective from the 001 group
2. Assign it ID = 1
3. Assign remaining objectives IDs starting from 2
```

**Step 2.2.3: Generate Incorrect Answer Options**

Only for **best-in-group** objectives:

```python
QuizGenerator.generate_lo_incorrect_answer_options()
  ↓
learning_objective_generator/enhancement.py
  → generate_incorrect_answer_options()
```

**Process:**
1. For each best-in-group objective:
   - Creates prompt with:
     - Objective and correct answer
     - INCORRECT_ANSWER_PROMPT guidelines
     - INCORRECT_ANSWER_EXAMPLES
     - Course content
   - Calls OpenAI API (with optional model override)
   - Generates 5 plausible incorrect answer options

2. **Returns:** `List[LearningObjective]` with incorrect_answer_options populated

**Step 2.2.4: Improve Incorrect Answers**

```python
learning_objective_generator.regenerate_incorrect_answers()
  ↓
learning_objective_generator/suggestion_improvement.py
```

**Quality Check Process:**
1. For each objective's incorrect answers:
   - Checks for red flags (contradictory phrases, absolute terms)
   - Examples of red flags:
     - "but not necessarily"
     - "at the expense of"
     - "rather than"
     - "always", "never", "exclusively"

2. If problems found:
   - Logs issue to `incorrect_suggestion_debug/` directory
   - Regenerates incorrect answers with additional constraints
   - Updates objective with improved answers

**Step 2.2.5: Final Assembly**

Creates final list where:
- Best-in-group objectives have enhanced incorrect answers
- Non-best-in-group objectives have empty `incorrect_answer_options: []`

#### Step 2.3: Display Results

**Three output formats:**

1. **Best-in-Group Objectives** (primary output):
   - Only objectives marked as best_in_group
   - Includes incorrect answer options
   - Sorted by ID
   - Formatted as JSON

2. **All Grouped Objectives**:
   - All objectives with grouping metadata
   - Shows group_members arrays
   - Best-in-group flags visible

3. **Raw Ungrouped** (debug):
   - Original objectives from all runs
   - No grouping metadata
   - Original temporary IDs

#### Step 2.4: State Update
```python
set_learning_objectives(grouped_result["all_grouped"])
set_processed_contents(file_contents)  # Already set, but persisted
```

### Phase 3: Question Generation

#### Step 3.1: Parse Learning Objectives

**Process:** `question_handlers._parse_learning_objectives()`

```python
1. Parse JSON from Tab 1 output
2. Create LearningObjective objects from dictionaries
3. Validate required fields
4. Return List[LearningObjective]
```

#### Step 3.2: Multi-Run Question Generation

**Process:** `question_handlers._generate_questions_multiple_runs()`

For each run (user-specified, typically 1 run):

```python
QuizGenerator.generate_questions_in_parallel()
  ↓
quiz_generator/assessment.py
  → generate_questions_in_parallel()
```

**Parallel Generation Process:**

1. **Thread Pool Setup:**
   ```python
   max_workers = min(len(learning_objectives), 5)
   ThreadPoolExecutor(max_workers=max_workers)
   ```

2. **For Each Learning Objective (in parallel):**

   **Step 3.2.1: Question Generation** (`quiz_generator/question_generation.py`)

   ```python
   generate_multiple_choice_question()
   ```

   **a) Source Content Matching:**
   ```python
   - Extract source_reference from objective
   - Search file_contents for matching XML tags
   - Exact match: <source file='filename.vtt'>
   - Fallback: Partial filename match
   - Last resort: Use all file contents combined
   ```

   **b) Multi-Source Handling:**
   ```python
   if len(source_references) > 1:
       Add special instruction:
       "Question should synthesize information across sources"
   ```

   **c) Prompt Construction:**
   ```python
   Combines:
   - Learning objective
   - Correct answer
   - Incorrect answer options from objective
   - GENERAL_QUALITY_STANDARDS
   - MULTIPLE_CHOICE_STANDARDS
   - EXAMPLE_QUESTIONS
   - QUESTION_SPECIFIC_QUALITY_STANDARDS
   - CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS
   - INCORRECT_ANSWER_EXAMPLES_WITH_EXPLANATION
   - ANSWER_FEEDBACK_QUALITY_STANDARDS
   - Matched course content
   ```

   **d) API Call:**
   ```python
   - Model: User-selected (default: gpt-5)
   - Temperature: User-selected (if supported by model)
   - Response format: MultipleChoiceQuestion
   - Returns: Question with 4 options, each with feedback
   ```

   **e) Post-Processing:**
   ```python
   - Set question ID = learning_objective ID
   - Verify all options have feedback
   - Add default feedback if missing
   ```

   **Step 3.2.2: Quality Assessment** (`quiz_generator/question_improvement.py`)

   ```python
   judge_question_quality()
   ```

   **Quality Judging Process:**
   ```python
   1. Creates evaluation prompt with:
      - Question text and all options
      - Quality criteria from prompts
      - Evaluation instructions

   2. LLM evaluates question for:
      - Clarity and unambiguity
      - Alignment with learning objective
      - Quality of incorrect options
      - Feedback quality
      - Appropriate difficulty

   3. Returns:
      - approved: bool
      - feedback: str (reasoning for judgment)

   4. Updates question:
      question.approved = approved
      question.judge_feedback = feedback
   ```

3. **Results Collection:**
   ```python
   - Questions collected as futures complete
   - IDs assigned sequentially across runs
   - All questions aggregated into single list
   ```

**Example:** 3 objectives × 1 run = 3 questions generated in parallel

#### Step 3.3: Grouping Questions

**Process:** `quiz_generator/question_ranking.py → group_questions()`

```python
1. Creates prompt with:
   - All generated questions
   - Grouping instructions
   - Example format

2. LLM identifies:
   - Questions testing same concept (same learning_objective_id)
   - Groups of similar questions
   - Best question in each group

3. Model: gpt-5-mini
   Response format: GroupedMultipleChoiceQuestionsResponse

4. Returns:
   {
     "grouped": [all questions with group metadata],
     "best_in_group": [best questions from each group]
   }
```

#### Step 3.4: Ranking Questions

**Process:** `quiz_generator/question_ranking.py → rank_questions()`

**Only ranks best-in-group questions:**

```python
1. Creates prompt with:
   - RANK_QUESTIONS_PROMPT
   - All quality standards
   - Best-in-group questions only
   - Course content for context

2. Ranking Criteria:
   - Question clarity and unambiguity
   - Alignment with learning objective
   - Quality of incorrect options
   - Feedback quality
   - Appropriate difficulty (prefers simple English)
   - Adherence to all guidelines
   - Avoidance of absolute terms

3. Special Instructions:
   - NEVER change question with ID=1
   - Each question gets unique rank (2, 3, 4, ...)
   - Rank 1 is reserved
   - All questions must be returned

4. Model: User-selected
   Response format: RankedMultipleChoiceQuestionsResponse

5. Returns:
   {
     "ranked": [questions with rank and ranking_reasoning]
   }
```

#### Step 3.5: Format Results

**Process:** `question_handlers._format_question_results()`

**Three outputs:**

1. **Best-in-Group Ranked Questions:**
   ```python
   - Sorted by rank
   - Includes all question data
   - Includes rank and ranking_reasoning
   - Includes group metadata
   - Formatted as JSON
   ```

2. **All Grouped Questions:**
   ```python
   - All questions with group metadata
   - No ranking information
   - Shows which questions are in groups
   - Formatted as JSON
   ```

3. **Formatted Quiz:**
   ```python
   format_quiz_for_ui() creates human-readable format:

   **Question 1 [Rank: 2]:** What is...

   Ranking Reasoning: ...

   • A [Correct]: Option text
     ◦ Feedback: Correct feedback

   • B: Option text
     ◦ Feedback: Incorrect feedback

   [continues for all questions]
   ```

### Phase 4: Custom Question Generation (Optional)

**Tab 3 Workflow:**

#### Step 4.1: User Input
User provides:
- Free-form guidance/feedback text
- Model selection
- Temperature setting

#### Step 4.2: Generation

**Process:** `feedback_handlers.propose_question_handler()`

```python
QuizGenerator.generate_multiple_choice_question_from_feedback()
  ↓
quiz_generator/feedback_questions.py
```

**Workflow:**
```python
1. Retrieves processed file contents from state

2. Creates prompt combining:
   - User feedback/guidance
   - All quality standards
   - Course content
   - Generation criteria

3. Model generates:
   - Single question
   - With learning objective inferred from guidance
   - 4 options with feedback
   - Source references

4. Returns: MultipleChoiceQuestionFromFeedback object
   (includes user feedback as metadata)

5. Formatted as JSON for display
```

### Phase 5: Assessment Export (Automated)

The final assessment can be saved using:

```python
QuizGenerator.save_assessment_to_json()
  ↓
quiz_generator/assessment.py → save_assessment_to_json()
```

**Process:**
```python
1. Convert Assessment object to dictionary
   assessment_dict = assessment.model_dump()

2. Write to JSON file with indent=2
   Default filename: "assessment.json"

3. Contains:
   - All learning objectives (best-in-group)
   - All ranked questions
   - Complete metadata
```

---

## Detailed Component Functionality

### Content Processor (`ui/content_processor.py`)

**Class: `ContentProcessor`**

**Methods:**

1. **`process_files(file_paths: List[str]) -> List[str]`**
   - Main entry point for processing multiple files
   - Returns list of XML-tagged content strings
   - Stores results in `self.file_contents`

2. **`process_file(file_path: str) -> List[str]`**
   - Routes to appropriate handler based on file extension
   - Returns single-item list with tagged content

3. **`_process_subtitle_file(file_path: str) -> List[str]`**
   - Filters out timestamps and metadata
   - Preserves actual subtitle text
   - Wraps in `<source file='...'>` tags

4. **`_process_notebook_file(file_path: str) -> List[str]`**
   - Validates JSON structure
   - Parses with nbformat
   - Extracts markdown and code cells
   - Falls back to raw text on parsing errors
   - Wraps in `<source file='...'>` tags

### Learning Objective Generator (`learning_objective_generator/`)

#### **generator.py - LearningObjectiveGenerator Class**

**Orchestrator that delegates to specialized modules:**

**Methods:**

1. **`generate_base_learning_objectives()`**
   - Delegates to `base_generation.py`
   - Returns base objectives with correct answers

2. **`group_base_learning_objectives()`**
   - Delegates to `grouping_and_ranking.py`
   - Groups similar objectives
   - Identifies best in each group

3. **`generate_incorrect_answer_options()`**
   - Delegates to `enhancement.py`
   - Adds 5 incorrect answer suggestions per objective

4. **`regenerate_incorrect_answers()`**
   - Delegates to `suggestion_improvement.py`
   - Quality-checks and improves incorrect answers

5. **`generate_and_group_learning_objectives()`**
   - Complete workflow method
   - Combines: base generation → grouping → incorrect answers
   - Returns dict with all_grouped and best_in_group

#### **base_generation.py**

**Key Functions:**

**`generate_base_learning_objectives()`**
- Wrapper that calls two separate functions
- First: Generate objectives without correct answers
- Second: Generate correct answers for those objectives

**`generate_base_learning_objectives_without_correct_answers()`**

**Process:**
```python
1. Extract source filenames from XML tags
2. Combine all file contents
3. Create prompt with:
   - BASE_LEARNING_OBJECTIVES_PROMPT
   - BLOOMS_TAXONOMY_LEVELS
   - LEARNING_OBJECTIVE_EXAMPLES_WITHOUT_ANSWERS
   - Course content
4. API call:
   - Model: User-selected
   - Temperature: User-selected (if supported)
   - Response format: BaseLearningObjectivesWithoutCorrectAnswerResponse
5. Post-process:
   - Assign sequential IDs
   - Normalize source_reference (extract basenames)
6. Returns: List[BaseLearningObjectiveWithoutCorrectAnswer]
```

**`generate_correct_answers_for_objectives()`**

**Process:**
```python
1. For each objective without answer:
   - Create prompt with objective + course content
   - Call OpenAI API (text response, not structured)
   - Extract correct answer
   - Create BaseLearningObjective with answer
2. Error handling: Add "[Error generating correct answer]" on failure
3. Returns: List[BaseLearningObjective]
```

**Quality Guidelines in Prompt:**
- Objectives must be assessable via multiple-choice
- Start with action verbs (identify, describe, define, list, compare)
- One goal per objective
- Derived directly from course content
- Tool/framework agnostic (focus on principles, not specific implementations)
- First objective should be relatively easy recall question
- Avoid objectives about "building" or "creating" (not MC-assessable)

#### **grouping_and_ranking.py**

**Key Functions:**

**`group_base_learning_objectives()`**

**Process:**
```python
1. Format objectives for display in prompt
2. Create grouping prompt with:
   - Original generation criteria
   - All base objectives
   - Course content
   - Grouping instructions
3. Special rule:
   - All objectives with IDs ending in 1 grouped together
   - Best one selected from this group
   - Will become primary objective (ID=1)
4. API call:
   - Model: "gpt-5-mini" (hardcoded for efficiency)
   - Response format: GroupedBaseLearningObjectivesResponse
5. Post-process:
   - Normalize best_in_group to Python bool
   - Filter for best-in-group objectives
6. Returns:
   {
     "all_grouped": List[GroupedBaseLearningObjective],
     "best_in_group": List[GroupedBaseLearningObjective]
   }
```

**Grouping Criteria:**
- Topic overlap
- Similarity of concepts
- Quality based on original generation criteria
- Clarity and specificity
- Alignment with course content

#### **enhancement.py**

**Key Function: `generate_incorrect_answer_options()`**

**Process:**
```python
1. For each base objective:
   - Create prompt with:
     - Learning objective and correct answer
     - INCORRECT_ANSWER_PROMPT (detailed guidelines)
     - INCORRECT_ANSWER_EXAMPLES
     - Course content
   - Request 5 plausible incorrect options
2. API call:
   - Model: model_override or default
   - Temperature: User-selected (if supported)
   - Response format: LearningObjective (includes incorrect_answer_options)
3. Returns: List[LearningObjective] with all fields populated
```

**Incorrect Answer Quality Principles:**
- Create common misunderstandings
- Maintain identical structure to correct answer
- Use course terminology correctly but in wrong contexts
- Include partially correct information
- Avoid obviously wrong answers
- Mirror detail level and style of correct answer
- Avoid absolute terms ("always", "never", "exclusively")
- Avoid contradictory second clauses

#### **suggestion_improvement.py**

**Key Function: `regenerate_incorrect_answers()`**

**Process:**
```python
1. For each learning objective:
   - Call should_regenerate_incorrect_answers()

2. should_regenerate_incorrect_answers():
   - Creates evaluation prompt with:
     - Objective and all incorrect options
     - IMMEDIATE_RED_FLAGS checklist
     - RULES_FOR_SECOND_CLAUSES
   - LLM evaluates each option
   - Returns: needs_regeneration: bool

3. If regeneration needed:
   - Logs to incorrect_suggestion_debug/{id}.txt
   - Creates new prompt with additional constraints
   - Regenerates incorrect answers
   - Validates again

4. Returns: List[LearningObjective] with improved incorrect answers
```

**Red Flags Checked:**
- Contradictory second clauses ("but not necessarily")
- Explicit negations ("without automating")
- Opposite descriptions ("fixed steps" for flexible systems)
- Absolute/comparative terms
- Hedging that creates limitations
- Trade-off language creating false dichotomies

### Quiz Generator (`quiz_generator/`)

#### **generator.py - QuizGenerator Class**

**Orchestrator with LearningObjectiveGenerator embedded:**

**Initialization:**
```python
def __init__(self, api_key, model="gpt-5", temperature=1.0):
    self.client = OpenAI(api_key=api_key)
    self.model = model
    self.temperature = temperature
    self.learning_objective_generator = LearningObjectiveGenerator(
        api_key=api_key, model=model, temperature=temperature
    )
```

**Methods (delegates to specialized modules):**

1. **`generate_base_learning_objectives()`** → delegates to LearningObjectiveGenerator
2. **`generate_lo_incorrect_answer_options()`** → delegates to LearningObjectiveGenerator
3. **`group_base_learning_objectives()`** → delegates to grouping_and_ranking.py
4. **`generate_multiple_choice_question()`** → delegates to question_generation.py
5. **`generate_questions_in_parallel()`** → delegates to assessment.py
6. **`group_questions()`** → delegates to question_ranking.py
7. **`rank_questions()`** → delegates to question_ranking.py
8. **`judge_question_quality()`** → delegates to question_improvement.py
9. **`regenerate_incorrect_answers()`** → delegates to question_improvement.py
10. **`generate_multiple_choice_question_from_feedback()`** → delegates to feedback_questions.py
11. **`save_assessment_to_json()`** → delegates to assessment.py

#### **question_generation.py**

**Key Function: `generate_multiple_choice_question()`**

**Detailed Process:**

**1. Source Content Matching:**
```python
source_references = learning_objective.source_reference
if isinstance(source_references, str):
    source_references = [source_references]

combined_content = ""
for source_file in source_references:
    # Try exact match: <source file='filename'>
    for file_content in file_contents:
        if f"<source file='{source_file}'>" in file_content:
            combined_content += file_content
            break

    # Fallback: partial match
    if not found:
        for file_content in file_contents:
            if source_file in file_content:
                combined_content += file_content
                break

# Last resort: use all content
if not combined_content:
    combined_content = "\n\n".join(file_contents)
```

**2. Multi-Source Instruction:**
```python
if len(source_references) > 1:
    Add special instruction:
    "This learning objective spans multiple sources.
     Your question should:
     1. Synthesize information across these sources
     2. Test understanding of overarching themes
     3. Require knowledge from multiple sources"
```

**3. Prompt Construction:**
Combines extensive quality standards:
```python
- Learning objective
- Correct answer
- Incorrect answer options from objective
- GENERAL_QUALITY_STANDARDS
- MULTIPLE_CHOICE_STANDARDS
- EXAMPLE_QUESTIONS
- QUESTION_SPECIFIC_QUALITY_STANDARDS
- CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS
- INCORRECT_ANSWER_EXAMPLES_WITH_EXPLANATION
- ANSWER_FEEDBACK_QUALITY_STANDARDS
- Multi-source instruction (if applicable)
- Matched course content
```

**4. API Call:**
```python
params = {
    "model": model,
    "messages": [
        {"role": "system", "content": "Expert educational assessment creator"},
        {"role": "user", "content": prompt}
    ],
    "response_format": MultipleChoiceQuestion
}
if not TEMPERATURE_UNAVAILABLE.get(model, True):
    params["temperature"] = temperature

response = client.beta.chat.completions.parse(**params)
```

**5. Post-Processing:**
```python
- Set response.id = learning_objective.id
- Set response.learning_objective_id = learning_objective.id
- Set response.learning_objective = learning_objective.learning_objective
- Set response.source_reference = learning_objective.source_reference
- Verify all options have feedback
- Add default feedback if missing
```

**6. Error Handling:**
```python
On exception:
- Create fallback question with 4 generic options
- Include error message in question_text
- Mark as questionable quality
```

#### **question_ranking.py**

**Key Functions:**

**`group_questions(questions, file_contents)`**

**Process:**
```python
1. Create prompt with:
   - GROUP_QUESTIONS_PROMPT
   - All questions with complete data
   - Grouping instructions

2. Grouping Logic:
   - Questions with same learning_objective_id are similar
   - Group by topic overlap
   - Mark best_in_group within each group
   - Single-member groups: best_in_group = true by default

3. API call:
   - Model: User-selected
   - Response format: GroupedMultipleChoiceQuestionsResponse

4. Critical Instructions:
   - MUST return ALL questions
   - Each question must have group metadata
   - best_in_group set appropriately

5. Returns:
   {
     "grouped": List[GroupedMultipleChoiceQuestion],
     "best_in_group": [questions where best_in_group=true]
   }
```

**`rank_questions(questions, file_contents)`**

**Process:**
```python
1. Create prompt with:
   - RANK_QUESTIONS_PROMPT
   - ALL quality standards (comprehensive)
   - Best-in-group questions only
   - Course content

2. Ranking Criteria (from prompt):
   - Question clarity and unambiguity
   - Alignment with learning objective
   - Quality of incorrect options
   - Feedback quality
   - Appropriate difficulty (simple English preferred)
   - Adherence to all guidelines
   - Avoidance of problematic words/phrases

3. Special Instructions:
   - DO NOT change question with ID=1
   - Rank starting from 2 (rank 1 reserved)
   - Each question gets unique rank
   - Must return ALL questions

4. API call:
   - Model: User-selected
   - Response format: RankedMultipleChoiceQuestionsResponse

5. Returns:
   {
     "ranked": List[RankedMultipleChoiceQuestion]
              (includes rank and ranking_reasoning for each)
   }
```

**Simple vs Complex English Examples (from ranking criteria):**
```
Simple: "AI engineers create computer programs that can learn from data"
Complex: "AI engineering practitioners architect computational paradigms
          exhibiting autonomous erudition capabilities"
```

#### **question_improvement.py**

**Key Functions:**

**`judge_question_quality(client, model, temperature, question)`**

**Process:**
```python
1. Create evaluation prompt with:
   - Question text
   - All options with feedback
   - Quality criteria
   - Evaluation instructions

2. LLM evaluates:
   - Clarity and lack of ambiguity
   - Alignment with learning objective
   - Quality of distractors (incorrect options)
   - Feedback quality and helpfulness
   - Appropriate difficulty level
   - Adherence to all standards

3. API call:
   - Unstructured text response
   - LLM returns: APPROVED or NOT APPROVED + reasoning

4. Parsing:
   approved = "APPROVED" in response.upper()
   feedback = full response text

5. Returns: (approved: bool, feedback: str)
```

**`should_regenerate_incorrect_answers(client, question, file_contents, model_name)`**

**Process:**
```python
1. Extract incorrect options from question

2. Create evaluation prompt with:
   - Each incorrect option
   - IMMEDIATE_RED_FLAGS checklist
   - Course content for context

3. LLM checks each option for:
   - Contradictory second clauses
   - Explicit negations
   - Absolute terms
   - Opposite descriptions
   - Trade-off language

4. Returns: needs_regeneration: bool

5. If true:
   - Log to wrong_answer_debug/ directory
   - Provides detailed feedback on issues
```

**`regenerate_incorrect_answers(client, model, temperature, questions, file_contents)`**

**Process:**
```python
1. For each question:
   - Check if regeneration needed
   - If yes:
     a. Create new prompt with stricter constraints
     b. Include original question for context
     c. Add specific rules about avoiding red flags
     d. Regenerate options
     e. Validate again
   - If no: keep original

2. Returns: List of questions with improved incorrect answers
```

#### **feedback_questions.py**

**Key Function: `generate_multiple_choice_question_from_feedback()`**

**Process:**
```python
1. Accept user feedback/guidance as free-form text

2. Create prompt combining:
   - User feedback
   - All quality standards
   - Course content
   - Standard generation criteria

3. LLM infers:
   - Learning objective from feedback
   - Appropriate question
   - 4 options with feedback
   - Source references

4. API call:
   - Model: User-selected
   - Response format: MultipleChoiceQuestionFromFeedback

5. Includes user feedback as metadata in response

6. Returns: Single question object
```

#### **assessment.py**

**Key Functions:**

**`generate_questions_in_parallel()`**

**Parallel Processing Details:**

```python
1. Setup:
   max_workers = min(len(learning_objectives), 5)
   # Limits to 5 concurrent threads

2. Thread Pool Executor:
   with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:

3. For each objective (in separate thread):

   Worker function:
   def generate_question_for_objective(objective, idx):
       - Generate question
       - Judge quality
       - Update with approval and feedback
       - Handle errors gracefully
       - Return complete question

4. Submit all tasks:
   future_to_idx = {
       executor.submit(generate_question_for_objective, obj, i): i
       for i, obj in enumerate(learning_objectives)
   }

5. Collect results as completed:
   for future in concurrent.futures.as_completed(future_to_idx):
       question = future.result()
       questions.append(question)
       print progress

6. Error handling:
   - Individual failures don't stop other threads
   - Placeholder questions created on error
   - All errors logged

7. Returns: List[MultipleChoiceQuestion] with quality judgments
```

**`save_assessment_to_json(assessment, output_path)`**

```python
1. Convert Pydantic model to dict:
   assessment_dict = assessment.model_dump()

2. Write to JSON file:
   with open(output_path, "w") as f:
       json.dump(assessment_dict, f, indent=2)

3. File contains:
   {
     "learning_objectives": [...],
     "questions": [...]
   }
```

### State Management (`ui/state.py`)

**Global State Variables:**
```python
processed_file_contents = []  # List of XML-tagged content strings
generated_learning_objectives = []  # List of learning objective objects
```

**Functions:**
- `get_processed_contents()` → retrieves file contents
- `set_processed_contents(contents)` → stores file contents
- `get_learning_objectives()` → retrieves objectives
- `set_learning_objectives(objectives)` → stores objectives
- `clear_state()` → resets both variables

**Purpose:**
- Persists data between UI tabs
- Allows Tab 2 to access content processed in Tab 1
- Allows Tab 3 to access content for custom questions
- Enables regeneration with feedback

### UI Handlers

#### **objective_handlers.py**

**`process_files(files, num_objectives, num_runs, model_name, incorrect_answer_model_name, temperature)`**

**Complete Workflow:**
```python
1. Validate inputs (files exist, API key present)
2. Extract file paths from Gradio file objects
3. Process files → get XML-tagged content
4. Store in state
5. Create QuizGenerator
6. Generate multiple runs of base objectives
7. Group and rank objectives
8. Generate incorrect answers for best-in-group
9. Improve incorrect answers
10. Reassign IDs (best from 001 group → ID=1)
11. Format results for display
12. Store in state
13. Return 4 outputs: status, best-in-group, all-grouped, raw
```

**`regenerate_objectives(objectives_json, feedback, num_objectives, num_runs, model_name, temperature)`**

**Workflow:**
```python
1. Retrieve processed contents from state
2. Append feedback to content:
   file_contents_with_feedback.append(f"FEEDBACK: {feedback}")
3. Generate new objectives with feedback context
4. Group and rank
5. Return regenerated objectives
```

**`_reassign_objective_ids(grouped_objectives)`**

**ID Assignment Logic:**
```python
1. Find all objectives with IDs ending in 001 (1001, 2001, etc.)
2. Identify their groups
3. Find best_in_group objective from these groups
4. Assign it ID = 1
5. Assign all other objectives sequential IDs starting from 2
```

**`_format_objective_results(grouped_result, all_learning_objectives)`**

**Formatting:**
```python
1. Sort by ID
2. Create dictionaries from Pydantic objects
3. Include all metadata fields
4. Convert to JSON with indent=2
5. Return 3 formatted outputs + status message
```

#### **question_handlers.py**

**`generate_questions(objectives_json, model_name, temperature, num_runs)`**

**Complete Workflow:**
```python
1. Validate inputs
2. Parse objectives JSON → create LearningObjective objects
3. Retrieve processed contents from state
4. Create QuizGenerator
5. Generate questions (multiple runs in parallel)
6. Group questions by similarity
7. Rank best-in-group questions
8. Optionally improve incorrect answers (currently commented out)
9. Format results
10. Return 4 outputs: status, best-ranked, all-grouped, formatted
```

**`_generate_questions_multiple_runs()`**

```python
For each run:
1. Call generate_questions_in_parallel()
2. Assign unique IDs across runs:
   start_id = len(all_questions) + 1
   for i, q in enumerate(run_questions):
       q.id = start_id + i
3. Aggregate all questions
```

**`_group_and_rank_questions()`**

```python
1. Group all questions → get grouped and best_in_group
2. Rank only best_in_group questions
3. Return:
   {
     "grouped": all with group metadata,
     "best_in_group_ranked": best with ranks
   }
```

#### **feedback_handlers.py**

**`propose_question_handler(guidance, model_name, temperature)`**

**Workflow:**
```python
1. Validate state (processed contents available)
2. Create QuizGenerator
3. Call generate_multiple_choice_question_from_feedback()
   - Passes user guidance and course content
   - LLM infers learning objective
   - Generates complete question
4. Format as JSON
5. Return status and question JSON
```

### Formatting Utilities (`ui/formatting.py`)

**`format_quiz_for_ui(questions_json)`**

**Process:**
```python
1. Parse JSON to list of question dictionaries
2. Sort by rank if available
3. For each question:
   - Add header: "**Question N [Rank: X]:** {question_text}"
   - Add ranking reasoning if available
   - For each option:
     - Add letter (A, B, C, D)
     - Mark correct option
     - Include option text
     - Include feedback indented
4. Return formatted string with markdown
```

**Output Example:**
```
**Question 1 [Rank: 2]:** What is the primary purpose of AI agents?

Ranking Reasoning: Clear question that tests fundamental understanding...

	• A [Correct]: To automate tasks and make decisions
	  ◦ Feedback: Correct! AI agents are designed to automate tasks...

	• B: To replace human workers entirely
	  ◦ Feedback: While AI agents can automate tasks, they are not...

[continues...]
```

---

## Quality Standards and Prompts

### Learning Objectives Quality Standards

**From `prompts/learning_objectives.py`:**

**BASE_LEARNING_OBJECTIVES_PROMPT - Key Requirements:**

1. **Assessability:**
   - Must be testable via multiple-choice questions
   - Cannot be about "building", "creating", "developing"
   - Should use verbs like: identify, list, describe, define, compare

2. **Specificity:**
   - One goal per objective
   - Don't combine multiple action verbs
   - Example of what NOT to do: "identify X and explain Y"

3. **Source Alignment:**
   - Derived DIRECTLY from course content
   - No topics not covered in content
   - Appropriate difficulty level for course

4. **Independence:**
   - Each objective stands alone
   - No dependencies on other objectives
   - No context required from other objectives

5. **Focus:**
   - Address "why" over "what" when possible
   - Critical knowledge over trivial facts
   - Principles over specific implementation details

6. **Tool/Framework Agnosticism:**
   - Don't mention specific tools/frameworks
   - Focus on underlying principles
   - Example: Don't ask about "Pandas DataFrame methods",
     ask about "data filtering concepts"

7. **First Objective Rule:**
   - Should be relatively easy recall question
   - Address main topic/concept of course
   - Format: "Identify what X is" or "Explain why X is important"

8. **Answer Length:**
   - Aim for ≤20 words in correct answer
   - Avoid unnecessary elaboration
   - No compound sentences with extra consequences

**BLOOMS_TAXONOMY_LEVELS:**

Levels from lowest to highest:
- **Recall:** Retention of key concepts (not trivialities)
- **Comprehension:** Connect ideas, demonstrate understanding
- **Application:** Apply concept to new but similar scenario
- **Analysis:** Examine parts, determine relationships, make inferences
- **Evaluation:** Make judgments requiring critical thinking

**LEARNING_OBJECTIVE_EXAMPLES:**

Includes 7 high-quality examples with:
- Appropriate action verbs
- Clear learning objectives
- Concise correct answers (mostly <20 words)
- Multiple source references
- Framework-agnostic language

### Question Quality Standards

**From `prompts/questions.py`:**

**GENERAL_QUALITY_STANDARDS:**

- Overall goal: Set learner up for success
- Perfect score attainable for thoughtful students
- Aligned with course content
- Aligned with learning objective and correct answer
- No references to manual intervention (software/AI course)

**MULTIPLE_CHOICE_STANDARDS:**

- **EXACTLY ONE** correct answer per question
- Clear, unambiguous correct answer
- Plausible distractors representing common misconceptions
- Not obviously wrong distractors
- All options similar length and detail
- Mutually exclusive options
- Avoid "all/none of the above"
- Typically 4 options (A, B, C, D)
- Don't start feedback with "Correct" or "Incorrect"

**QUESTION_SPECIFIC_QUALITY_STANDARDS:**

Questions must:
- Match language and tone of course
- Match difficulty level of course
- Assess only course information
- Not teach as part of quiz
- Use clear, concise language
- Not induce confusion
- Provide slight (not major) challenge
- Be easily interpreted and unambiguous
- Have proper grammar and sentence structure
- Be thoughtful and specific (not broad and ambiguous)
- Be complete in wording (understanding question shouldn't be part of assessment)

**CORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS:**

Correct answers must:
- Be factually correct and unambiguous
- Match course language and tone
- Be complete sentences
- Match course difficulty level
- Contain only course information
- Not teach during quiz
- Use clear, concise language
- Be thoughtful and specific
- Be complete (identifying correct answer shouldn't require interpretation)

**INCORRECT_ANSWER_SPECIFIC_QUALITY_STANDARDS:**

Incorrect answers should:
- Represent reasonable potential misconceptions
- Sound plausible to non-experts
- Require thought even from diligent learners
- Not be obviously wrong
- Use incorrect_answer_suggestions from objective (as starting point)

**Avoid:**
- Obviously wrong options anyone can eliminate
- Absolute terms: "always", "never", "only", "exclusively"
- Phrases like "used exclusively for scenarios where..."

**ANSWER_FEEDBACK_QUALITY_STANDARDS:**

**For Incorrect Answers:**
- Be informational and encouraging (not punitive)
- Single sentence, concise
- Do NOT say "Incorrect" or "Wrong"

**For Correct Answers:**
- Be informational and encouraging
- Single sentence, concise
- Do NOT say "Correct!" (redundant after "Correct: " prefix)

### Incorrect Answer Generation Guidelines

**From `prompts/incorrect_answers.py`:**

**Core Principles:**

1. **Create Common Misunderstandings:**
   - Represent how students actually misunderstand
   - Confuse related concepts
   - Mix up terminology

2. **Maintain Identical Structure:**
   - Match grammatical pattern of correct answer
   - Same length and complexity
   - Same formatting style

3. **Use Course Terminology Correctly but in Wrong Contexts:**
   - Apply correct terms incorrectly
   - Confuse with related concepts
   - Example: Describe backpropagation but actually describe forward propagation

4. **Include Partially Correct Information:**
   - First part correct, second part wrong
   - Correct process but wrong application
   - Correct concept but incomplete

5. **Avoid Obviously Wrong Answers:**
   - No contradictions with basic knowledge
   - Not immediately eliminable
   - Require course knowledge to reject

6. **Mirror Detail Level and Style:**
   - Match technical depth
   - Match tone
   - Same level of specificity

7. **For Lists, Maintain Consistency:**
   - Same number of items
   - Same format
   - Mix some correct with incorrect items

8. **AVOID ABSOLUTE TERMS:**
   - "always", "never", "exclusively", "primarily"
   - "all", "every", "none", "nothing", "only"
   - "must", "required", "impossible"
   - "rather than", "as opposed to", "instead of"

**IMMEDIATE_RED_FLAGS** (triggers regeneration):

**Contradictory Second Clauses:**
- "but not necessarily"
- "at the expense of"
- "rather than [core concept]"
- "ensuring X rather than Y"
- "without necessarily"
- "but has no impact on"
- "but cannot", "but prevents", "but limits"

**Explicit Negations:**
- "without automating", "without incorporating"
- "preventing [main benefit]"
- "limiting [main capability]"

**Opposite Descriptions:**
- "fixed steps" (for flexible systems)
- "manual intervention" (for automation)
- "simple question answering" (for complex processing)

**Hedging Creating Limitations:**
- "sometimes", "occasionally", "might"
- "to some extent", "partially", "somewhat"

**INCORRECT_ANSWER_EXAMPLES:**

Includes 10 detailed examples showing:
- Learning objective
- Correct answer
- 3 plausible incorrect suggestions
- Explanation of why each is plausible but wrong
- Consistent formatting across all options

### Ranking and Grouping

**RANK_QUESTIONS_PROMPT:**

**Criteria:**
1. Question clarity and unambiguity
2. Alignment with learning objective
3. Quality of incorrect options
4. Quality of feedback
5. Appropriate difficulty (simple English preferred)
6. Adherence to all guidelines

**Critical Instructions:**
- DO NOT change question with ID=1
- Rank starting from 2
- Each question unique rank
- Must return ALL questions
- No omissions
- No duplicate ranks

**Simple vs Complex English:**
```
Simple: "AI engineers create computer programs that learn from data"
Complex: "AI engineering practitioners architect computational paradigms
          exhibiting autonomous erudition capabilities"
```

**GROUP_QUESTIONS_PROMPT:**

**Grouping Logic:**
- Questions with same learning_objective_id are similar
- Identify topic overlap
- Mark best_in_group within each group
- Single-member groups: best_in_group = true

**Critical Instructions:**
- Must return ALL questions
- Each question needs group metadata
- No omissions
- Best in group marked appropriately

---

## Summary of Data Flow

### Complete End-to-End Flow

```
User Uploads Files
      ↓
ContentProcessor extracts and tags content
      ↓
[Stored in global state]
      ↓
Generate Base Objectives (multiple runs)
      ↓
Group Base Objectives (by similarity)
      ↓
Generate Incorrect Answers (for best-in-group only)
      ↓
Improve Incorrect Answers (quality check)
      ↓
Reassign IDs (best from 001 group → ID=1)
      ↓
[Objectives displayed in UI, stored in state]
      ↓
Generate Questions (parallel, multiple runs)
      ↓
Judge Question Quality (parallel)
      ↓
Group Questions (by similarity)
      ↓
Rank Questions (best-in-group only)
      ↓
[Questions displayed in UI]
      ↓
Format for Display
      ↓
Export to JSON (optional)
```

### Key Optimization Strategies

1. **Multiple Generation Runs:**
   - Generates variety of objectives/questions
   - Grouping identifies best versions
   - Reduces risk of poor quality individual outputs

2. **Hierarchical Processing:**
   - Generate base → Group → Enhance → Improve
   - Only enhances best candidates (saves API calls)
   - Progressive refinement

3. **Parallel Processing:**
   - Questions generated concurrently (up to 5 threads)
   - Significant time savings for multiple objectives
   - Independent evaluations

4. **Quality Gating:**
   - LLM judges question quality
   - Checks for red flags in incorrect answers
   - Regenerates problematic content

5. **Source Tracking:**
   - XML tags preserve origin
   - Questions link back to source materials
   - Enables accurate content matching

6. **Modular Prompts:**
   - Reusable quality standards
   - Consistent across all generations
   - Easy to update centrally

---

## Configuration and Customization

### Available Models

**Configured in `models/config.py`:**
```python
MODELS = [
    "o3-mini", "o1",           # Reasoning models (no temperature)
    "gpt-4.1", "gpt-4o",       # GPT-4 variants
    "gpt-4o-mini", "gpt-4",
    "gpt-3.5-turbo",           # Legacy
    "gpt-5",                   # Latest (no temperature)
    "gpt-5-mini",              # Efficient (no temperature)
    "gpt-5-nano"               # Ultra-efficient (no temperature)
]
```

**Temperature Support:**
- Models with reasoning (o1, o3-mini, gpt-5 variants): No temperature
- Other models: Temperature 0.0 to 1.0

**Model Selection Strategy:**
- **Base objectives:** User-selected (default: gpt-5)
- **Grouping:** Hardcoded gpt-5-mini (efficiency)
- **Incorrect answers:** Separate user selection (default: gpt-5)
- **Questions:** User-selected (default: gpt-5)
- **Quality judging:** User-selected or gpt-5-mini

### Environment Variables

**Required:**
```
OPENAI_API_KEY=your_api_key_here
```

**Configured via `.env` file in project root**

### Customization Points

1. **Quality Standards:**
   - Edit `prompts/learning_objectives.py`
   - Edit `prompts/questions.py`
   - Edit `prompts/incorrect_answers.py`
   - Changes apply to all future generations

2. **Example Questions/Objectives:**
   - Modify LEARNING_OBJECTIVE_EXAMPLES
   - Modify EXAMPLE_QUESTIONS
   - Modify INCORRECT_ANSWER_EXAMPLES
   - LLM learns from these examples

3. **Generation Parameters:**
   - Number of objectives per run
   - Number of runs (variety)
   - Temperature (creativity vs consistency)
   - Model selection (quality vs cost/speed)

4. **Parallel Processing:**
   - `max_workers` in assessment.py
   - Currently: min(len(objectives), 5)
   - Adjust for your rate limits

5. **Output Formats:**
   - Modify `formatting.py` for display
   - Assessment JSON structure in `models/assessment.py`

---

## Error Handling and Resilience

### Content Processing Errors

- **Invalid JSON notebooks:** Falls back to raw text
- **Parse failures:** Wraps in code blocks, continues
- **Missing files:** Logged, skipped
- **Encoding issues:** UTF-8 fallback

### Generation Errors

- **API failures:** Logged with traceback
- **Structured output parse errors:** Fallback responses created
- **Missing required fields:** Default values assigned
- **Validation errors:** Caught and logged

### Parallel Processing Errors

- **Individual thread failures:** Don't stop other threads
- **Placeholder questions:** Created on error
- **Complete error details:** Logged for debugging
- **Graceful degradation:** Partial results returned

### Quality Check Failures

- **Regeneration failures:** Original kept with warning
- **Judge unavailable:** Questions marked unapproved
- **Validation failures:** Detailed logs in debug directories

---

## Debug and Logging

### Debug Directories

1. **`incorrect_suggestion_debug/`**
   - Created during objective enhancement
   - Contains logs of problematic incorrect answers
   - Format: `{objective_id}.txt`
   - Includes: Original suggestions, identified issues, regeneration attempts

2. **`wrong_answer_debug/`**
   - Created during question improvement
   - Logs question-level incorrect answer issues
   - Regeneration history

### Console Logging

**Extensive logging throughout:**
- File processing status
- Generation progress (run numbers)
- Parallel thread activity (thread IDs)
- API call results
- Error messages with tracebacks
- Timing information (start/end times)

**Example Log Output:**
```
DEBUG - Processing 3 files: ['file1.vtt', 'file2.ipynb', 'file3.srt']
DEBUG - Found source file: file1.vtt
Generating 3 learning objectives from 3 files
Successfully generated 3 learning objectives without correct answers
Generated correct answer for objective 1
Grouping 9 base learning objectives
Received 9 grouped results
Generating incorrect answer options only for best-in-group objectives...
PARALLEL: Starting ThreadPoolExecutor with 3 workers
PARALLEL: Worker 1 (Thread ID: 12345): Starting work on objective...
Question generation completed in 45.23 seconds
```

---

## Performance Considerations

### API Call Optimization

**Calls per Workflow:**

For 3 objectives × 3 runs = 9 base objectives:

1. **Learning Objectives:**
   - Base generation: 3 calls (one per run)
   - Correct answers: 9 calls (one per objective)
   - Grouping: 1 call
   - Incorrect answers: ~3 calls (best-in-group only)
   - Improvement checks: ~3 calls
   - **Total: ~19 calls**

2. **Questions (for 3 objectives × 1 run):**
   - Question generation: 3 calls (parallel)
   - Quality judging: 3 calls (parallel)
   - Grouping: 1 call
   - Ranking: 1 call
   - **Total: ~8 calls**

**Total for complete workflow: ~27 API calls**

### Time Estimates

**Typical Execution Times:**
- File processing: <1 second
- Objective generation (3×3): 30-60 seconds
- Question generation (3×1): 20-40 seconds (with parallelization)
- **Total: 1-2 minutes for small course**

**Factors Affecting Speed:**
- Model selection (gpt-5 slower than gpt-5-mini)
- Number of runs
- Number of objectives/questions
- API rate limits
- Network latency
- Parallel worker count

### Cost Optimization

**Strategies:**
1. Use gpt-5-mini for grouping/ranking (hardcoded)
2. Reduce number of runs (trade-off: variety)
3. Generate fewer objectives initially
4. Use faster models for initial exploration
5. Use premium models for final production

---

## Conclusion

The AI Course Assessment Generator is a sophisticated, multi-stage system that transforms raw course materials into high-quality educational assessments. It employs:

- **Modular architecture** for maintainability
- **Structured output generation** for reliability
- **Quality-driven iterative refinement** for excellence
- **Parallel processing** for efficiency
- **Comprehensive error handling** for resilience

The system successfully balances automation with quality control, producing assessments that align with educational best practices and Bloom's Taxonomy while maintaining complete traceability to source materials.