File size: 3,744 Bytes
d74863e
 
 
3c66ec8
d74863e
 
 
c00d17d
3c66ec8
 
 
 
 
d74863e
 
 
3c66ec8
 
 
 
 
 
 
c00d17d
d74863e
 
3c66ec8
 
 
d74863e
 
3c66ec8
d74863e
 
c00d17d
3c66ec8
c00d17d
3c66ec8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d74863e
3c66ec8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d74863e
 
 
3c66ec8
d74863e
 
 
 
 
 
3c66ec8
 
d74863e
 
 
 
 
c00d17d
 
 
 
3c66ec8
c00d17d
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# Summarization Module ๐Ÿ“

## Responsibility
This module handles **text summarization and conversion to structured study notes**.

## Functionality
1. Receive transcribed text from videos.
2. Use a **local mT5 model** (map-reduce pipeline) to analyze text and generate structured JSON notes.
3. Produce clean Markdown output with:
   - Source & Duration header
   - Overall Summary
   - Chronological Timeline (3-7 segments with Key Insight + Why It Matters)
   - Conclusion

## Files

### 1. `schemas.py`
- **Purpose:** Single source of truth for all Pydantic data models.
- **Key Classes:**
  - `SummarySchema` โ€” Full structured output (title, detected_language, summary, segments, conclusion, topics).
  - `SegmentSchema` โ€” A timeline section (title, summary, key_insight, why_it_matters).

### 2. `note_generator.py`
- **Purpose:** Generate notes using a local mT5 model with chunk-based map-reduce and schema validation.
- **Main Class:** `NoteGenerator`
- **Key Methods:**
  - `generateSummary(transcript, title)` โ€” Generates structured JSON study notes.
  - `format_notes_to_markdown(json_notes)` โ€” Converts JSON to clean Markdown.
  - `format_final_notes(notes, title, url, duration)` โ€” Wraps Markdown with Source/Duration header.

### 3. `segmenter.py`
- **Purpose:** Split long texts into smaller segments for preprocessing.
- **Main Class:** `TranscriptSegmenter`
- **Key Methods:**
  - `segment_text_by_words()` โ€” Split text into fixed-size word chunks (used by the mT5 pipeline).
  - `segment_by_time()` โ€” Split by time intervals.
  - `segment_by_topic()` โ€” Split by paragraph/topic boundaries.
  - `clean_text()` โ€” Remove filler words.

## JSON Output Structure
```json
{
    "title": "...",
    "detected_language": "English",
    "summary": "Overall summary (3-5 sentences)",
    "segments": [
        {
            "title": "Segment title",
            "summary": "What this section covers",
            "key_insight": "Most important point",
            "why_it_matters": "Why this is valuable"
        }
    ],
    "conclusion": "Final takeaway",
    "topics": ["Topic1", "Topic2"]
}
```

> **Note:** `topics` is hidden metadata โ€” not rendered in markdown, used by downstream modules only.

## Markdown Output Order
1. **Source** โ€” video URL
2. **Duration** โ€” video length
3. **Overall Summary** โ€” one concise summary
4. **Timeline** โ€” chronological segments (3-7), each with Key Insight + Why It Matters
5. **Conclusion** โ€” final takeaway

## Labels (Localized)
| Key | English | Arabic |
|-----|---------|--------|
| source | Source | ุงู„ู…ุตุฏุฑ |
| duration | Duration | ุงู„ู…ุฏุฉ |
| summary | Overall Summary | ุงู„ู…ู„ุฎุต ุงู„ุนุงู… |
| timeline | Timeline | ุงู„ุชุณู„ุณู„ ุงู„ุฒู…ู†ูŠ |
| insight | Key Insight | ุฃู‡ู… ู†ู‚ุทุฉ |
| why | Why It Matters | ู„ู…ุงุฐุง ูŠู‡ู…ุŸ |
| conclusion | Conclusion | ุงู„ุฎู„ุงุตุฉ |

## Testing
```python
from src.summarization.note_generator import NoteGenerator

generator = NoteGenerator()
transcript = "Here is the complete video transcript..."
title = "Introduction to Python"

# Generate notes
summary_json = generator.generateSummary(transcript, title)
notes_md = generator.format_notes_to_markdown(summary_json)

print(notes_md)
```

## Libraries Used
- `transformers` โ€” Load and run the local mT5 model (HuggingFace).
- `sentencepiece` โ€” Tokenizer backend required by mT5.
- `langdetect` โ€” Automatic language detection for multilingual support.
- `torch` โ€” PyTorch runtime for model inference.
- `pydantic` โ€” Data validation and schema enforcement.

## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MT5_MODEL_NAME` | `google/mt5-small` | HuggingFace model ID to load |