amanyelfiky's picture
Mt5
c00d17d
# Summarization Module ๐Ÿ“
## Responsibility
This module handles **text summarization and conversion to structured study notes**.
## Functionality
1. Receive transcribed text from videos.
2. Use a **local mT5 model** (map-reduce pipeline) to analyze text and generate structured JSON notes.
3. Produce clean Markdown output with:
- Source & Duration header
- Overall Summary
- Chronological Timeline (3-7 segments with Key Insight + Why It Matters)
- Conclusion
## Files
### 1. `schemas.py`
- **Purpose:** Single source of truth for all Pydantic data models.
- **Key Classes:**
- `SummarySchema` โ€” Full structured output (title, detected_language, summary, segments, conclusion, topics).
- `SegmentSchema` โ€” A timeline section (title, summary, key_insight, why_it_matters).
### 2. `note_generator.py`
- **Purpose:** Generate notes using a local mT5 model with chunk-based map-reduce and schema validation.
- **Main Class:** `NoteGenerator`
- **Key Methods:**
- `generateSummary(transcript, title)` โ€” Generates structured JSON study notes.
- `format_notes_to_markdown(json_notes)` โ€” Converts JSON to clean Markdown.
- `format_final_notes(notes, title, url, duration)` โ€” Wraps Markdown with Source/Duration header.
### 3. `segmenter.py`
- **Purpose:** Split long texts into smaller segments for preprocessing.
- **Main Class:** `TranscriptSegmenter`
- **Key Methods:**
- `segment_text_by_words()` โ€” Split text into fixed-size word chunks (used by the mT5 pipeline).
- `segment_by_time()` โ€” Split by time intervals.
- `segment_by_topic()` โ€” Split by paragraph/topic boundaries.
- `clean_text()` โ€” Remove filler words.
## JSON Output Structure
```json
{
"title": "...",
"detected_language": "English",
"summary": "Overall summary (3-5 sentences)",
"segments": [
{
"title": "Segment title",
"summary": "What this section covers",
"key_insight": "Most important point",
"why_it_matters": "Why this is valuable"
}
],
"conclusion": "Final takeaway",
"topics": ["Topic1", "Topic2"]
}
```
> **Note:** `topics` is hidden metadata โ€” not rendered in markdown, used by downstream modules only.
## Markdown Output Order
1. **Source** โ€” video URL
2. **Duration** โ€” video length
3. **Overall Summary** โ€” one concise summary
4. **Timeline** โ€” chronological segments (3-7), each with Key Insight + Why It Matters
5. **Conclusion** โ€” final takeaway
## Labels (Localized)
| Key | English | Arabic |
|-----|---------|--------|
| source | Source | ุงู„ู…ุตุฏุฑ |
| duration | Duration | ุงู„ู…ุฏุฉ |
| summary | Overall Summary | ุงู„ู…ู„ุฎุต ุงู„ุนุงู… |
| timeline | Timeline | ุงู„ุชุณู„ุณู„ ุงู„ุฒู…ู†ูŠ |
| insight | Key Insight | ุฃู‡ู… ู†ู‚ุทุฉ |
| why | Why It Matters | ู„ู…ุงุฐุง ูŠู‡ู…ุŸ |
| conclusion | Conclusion | ุงู„ุฎู„ุงุตุฉ |
## Testing
```python
from src.summarization.note_generator import NoteGenerator
generator = NoteGenerator()
transcript = "Here is the complete video transcript..."
title = "Introduction to Python"
# Generate notes
summary_json = generator.generateSummary(transcript, title)
notes_md = generator.format_notes_to_markdown(summary_json)
print(notes_md)
```
## Libraries Used
- `transformers` โ€” Load and run the local mT5 model (HuggingFace).
- `sentencepiece` โ€” Tokenizer backend required by mT5.
- `langdetect` โ€” Automatic language detection for multilingual support.
- `torch` โ€” PyTorch runtime for model inference.
- `pydantic` โ€” Data validation and schema enforcement.
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MT5_MODEL_NAME` | `google/mt5-small` | HuggingFace model ID to load |