Spaces:

ATInc1
/

AIdea-Server

Running

File size: 3,744 Bytes

# Summarization Module 📝

## Responsibility
This module handles **text summarization and conversion to structured study notes**.

## Functionality
1. Receive transcribed text from videos.
2. Use a **local mT5 model** (map-reduce pipeline) to analyze text and generate structured JSON notes.
3. Produce clean Markdown output with:
   - Source & Duration header
   - Overall Summary
   - Chronological Timeline (3-7 segments with Key Insight + Why It Matters)
   - Conclusion

## Files

### 1. `schemas.py`
- **Purpose:** Single source of truth for all Pydantic data models.
- **Key Classes:**
  - `SummarySchema` — Full structured output (title, detected_language, summary, segments, conclusion, topics).
  - `SegmentSchema` — A timeline section (title, summary, key_insight, why_it_matters).

### 2. `note_generator.py`
- **Purpose:** Generate notes using a local mT5 model with chunk-based map-reduce and schema validation.
- **Main Class:** `NoteGenerator`
- **Key Methods:**
  - `generateSummary(transcript, title)` — Generates structured JSON study notes.
  - `format_notes_to_markdown(json_notes)` — Converts JSON to clean Markdown.
  - `format_final_notes(notes, title, url, duration)` — Wraps Markdown with Source/Duration header.

### 3. `segmenter.py`
- **Purpose:** Split long texts into smaller segments for preprocessing.
- **Main Class:** `TranscriptSegmenter`
- **Key Methods:**
  - `segment_text_by_words()` — Split text into fixed-size word chunks (used by the mT5 pipeline).
  - `segment_by_time()` — Split by time intervals.
  - `segment_by_topic()` — Split by paragraph/topic boundaries.
  - `clean_text()` — Remove filler words.

## JSON Output Structure
```json
{
    "title": "...",
    "detected_language": "English",
    "summary": "Overall summary (3-5 sentences)",
    "segments": [
        {
            "title": "Segment title",
            "summary": "What this section covers",
            "key_insight": "Most important point",
            "why_it_matters": "Why this is valuable"
        }
    ],
    "conclusion": "Final takeaway",
    "topics": ["Topic1", "Topic2"]
}
```

> **Note:** `topics` is hidden metadata — not rendered in markdown, used by downstream modules only.

## Markdown Output Order
1. **Source** — video URL
2. **Duration** — video length
3. **Overall Summary** — one concise summary
4. **Timeline** — chronological segments (3-7), each with Key Insight + Why It Matters
5. **Conclusion** — final takeaway

## Labels (Localized)
| Key | English | Arabic |
|-----|---------|--------|
| source | Source | المصدر |
| duration | Duration | المدة |
| summary | Overall Summary | الملخص العام |
| timeline | Timeline | التسلسل الزمني |
| insight | Key Insight | أهم نقطة |
| why | Why It Matters | لماذا يهم؟ |
| conclusion | Conclusion | الخلاصة |

## Testing
```python
from src.summarization.note_generator import NoteGenerator

generator = NoteGenerator()
transcript = "Here is the complete video transcript..."
title = "Introduction to Python"

# Generate notes
summary_json = generator.generateSummary(transcript, title)
notes_md = generator.format_notes_to_markdown(summary_json)

print(notes_md)
```

## Libraries Used
- `transformers` — Load and run the local mT5 model (HuggingFace).
- `sentencepiece` — Tokenizer backend required by mT5.
- `langdetect` — Automatic language detection for multilingual support.
- `torch` — PyTorch runtime for model inference.
- `pydantic` — Data validation and schema enforcement.

## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MT5_MODEL_NAME` | `google/mt5-small` | HuggingFace model ID to load |