Spaces:
Running
Running
| # Summarization Module ๐ | |
| ## Responsibility | |
| This module handles **text summarization and conversion to structured study notes**. | |
| ## Functionality | |
| 1. Receive transcribed text from videos. | |
| 2. Use a **local mT5 model** (map-reduce pipeline) to analyze text and generate structured JSON notes. | |
| 3. Produce clean Markdown output with: | |
| - Source & Duration header | |
| - Overall Summary | |
| - Chronological Timeline (3-7 segments with Key Insight + Why It Matters) | |
| - Conclusion | |
| ## Files | |
| ### 1. `schemas.py` | |
| - **Purpose:** Single source of truth for all Pydantic data models. | |
| - **Key Classes:** | |
| - `SummarySchema` โ Full structured output (title, detected_language, summary, segments, conclusion, topics). | |
| - `SegmentSchema` โ A timeline section (title, summary, key_insight, why_it_matters). | |
| ### 2. `note_generator.py` | |
| - **Purpose:** Generate notes using a local mT5 model with chunk-based map-reduce and schema validation. | |
| - **Main Class:** `NoteGenerator` | |
| - **Key Methods:** | |
| - `generateSummary(transcript, title)` โ Generates structured JSON study notes. | |
| - `format_notes_to_markdown(json_notes)` โ Converts JSON to clean Markdown. | |
| - `format_final_notes(notes, title, url, duration)` โ Wraps Markdown with Source/Duration header. | |
| ### 3. `segmenter.py` | |
| - **Purpose:** Split long texts into smaller segments for preprocessing. | |
| - **Main Class:** `TranscriptSegmenter` | |
| - **Key Methods:** | |
| - `segment_text_by_words()` โ Split text into fixed-size word chunks (used by the mT5 pipeline). | |
| - `segment_by_time()` โ Split by time intervals. | |
| - `segment_by_topic()` โ Split by paragraph/topic boundaries. | |
| - `clean_text()` โ Remove filler words. | |
| ## JSON Output Structure | |
| ```json | |
| { | |
| "title": "...", | |
| "detected_language": "English", | |
| "summary": "Overall summary (3-5 sentences)", | |
| "segments": [ | |
| { | |
| "title": "Segment title", | |
| "summary": "What this section covers", | |
| "key_insight": "Most important point", | |
| "why_it_matters": "Why this is valuable" | |
| } | |
| ], | |
| "conclusion": "Final takeaway", | |
| "topics": ["Topic1", "Topic2"] | |
| } | |
| ``` | |
| > **Note:** `topics` is hidden metadata โ not rendered in markdown, used by downstream modules only. | |
| ## Markdown Output Order | |
| 1. **Source** โ video URL | |
| 2. **Duration** โ video length | |
| 3. **Overall Summary** โ one concise summary | |
| 4. **Timeline** โ chronological segments (3-7), each with Key Insight + Why It Matters | |
| 5. **Conclusion** โ final takeaway | |
| ## Labels (Localized) | |
| | Key | English | Arabic | | |
| |-----|---------|--------| | |
| | source | Source | ุงูู ุตุฏุฑ | | |
| | duration | Duration | ุงูู ุฏุฉ | | |
| | summary | Overall Summary | ุงูู ูุฎุต ุงูุนุงู | | |
| | timeline | Timeline | ุงูุชุณูุณู ุงูุฒู ูู | | |
| | insight | Key Insight | ุฃูู ููุทุฉ | | |
| | why | Why It Matters | ูู ุงุฐุง ููู ุ | | |
| | conclusion | Conclusion | ุงูุฎูุงุตุฉ | | |
| ## Testing | |
| ```python | |
| from src.summarization.note_generator import NoteGenerator | |
| generator = NoteGenerator() | |
| transcript = "Here is the complete video transcript..." | |
| title = "Introduction to Python" | |
| # Generate notes | |
| summary_json = generator.generateSummary(transcript, title) | |
| notes_md = generator.format_notes_to_markdown(summary_json) | |
| print(notes_md) | |
| ``` | |
| ## Libraries Used | |
| - `transformers` โ Load and run the local mT5 model (HuggingFace). | |
| - `sentencepiece` โ Tokenizer backend required by mT5. | |
| - `langdetect` โ Automatic language detection for multilingual support. | |
| - `torch` โ PyTorch runtime for model inference. | |
| - `pydantic` โ Data validation and schema enforcement. | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `MT5_MODEL_NAME` | `google/mt5-small` | HuggingFace model ID to load | | |