# Summarization Module 📝 ## Responsibility This module handles **text summarization and conversion to structured study notes**. ## Functionality 1. Receive transcribed text from videos. 2. Use **Groq (Llama-3.3-70b-versatile)** to analyze text and generate structured JSON notes. 3. Produce clean Markdown output with: - Source & Duration header - Overall Summary - Chronological Timeline (3-7 segments with Key Insight + Why It Matters) - Conclusion ## Files ### 1. `schemas.py` - **Purpose:** Single source of truth for all Pydantic data models. - **Key Classes:** - `SummarySchema` — Full structured output (title, detected_language, summary, segments, conclusion, topics). - `SegmentSchema` — A timeline section (title, summary, key_insight, why_it_matters). ### 2. `note_generator.py` - **Purpose:** Generate notes using Groq AI with strict JSON enforcement. - **Main Class:** `NoteGenerator` - **Key Methods:** - `generateSummary(transcript, title)` — Generates structured JSON study notes. - `format_notes_to_markdown(json_notes)` — Converts JSON to clean Markdown. - `format_final_notes(notes, title, url, duration)` — Wraps Markdown with Source/Duration header. ### 3. `segmenter.py` - **Purpose:** Split long texts into smaller segments for preprocessing. - **Main Class:** `TranscriptSegmenter` - **Key Methods:** - `segment_by_time()` — Split by time intervals. - `clean_text()` — Remove filler words. ## JSON Output Structure ```json { "title": "...", "detected_language": "English", "summary": "Overall summary (3-5 sentences)", "segments": [ { "title": "Segment title", "summary": "What this section covers", "key_insight": "Most important point", "why_it_matters": "Why this is valuable" } ], "conclusion": "Final takeaway", "topics": ["Topic1", "Topic2"] } ``` > **Note:** `topics` is hidden metadata — not rendered in markdown, used by downstream modules only. ## Markdown Output Order 1. **Source** — video URL 2. **Duration** — video length 3. **Overall Summary** — one concise summary 4. **Timeline** — chronological segments (3-7), each with Key Insight + Why It Matters 5. **Conclusion** — final takeaway ## Labels (Localized) | Key | English | Arabic | |-----|---------|--------| | source | Source | المصدر | | duration | Duration | المدة | | summary | Overall Summary | الملخص العام | | timeline | Timeline | التسلسل الزمني | | insight | Key Insight | أهم نقطة | | why | Why It Matters | لماذا يهم؟ | | conclusion | Conclusion | الخلاصة | ## Testing ```python from src.summarization.note_generator import NoteGenerator generator = NoteGenerator() transcript = "Here is the complete video transcript..." title = "Introduction to Python" # Generate notes summary_json = generator.generateSummary(transcript, title) notes_md = generator.format_notes_to_markdown(summary_json) print(notes_md) ``` ## Libraries Used - `groq` — Communicate with Groq API (Llama-3.3-70b-versatile). - `pydantic` — Data validation and schema enforcement.