Spaces:

ATInc1
/

AIdea-Server

Running

App Files Files Community

AIdea-Server / src /summarization /README.md

amanyelfiky

Mt5

c00d17d about 15 hours ago

preview code

raw

history blame contribute delete

3.74 kB

	# Summarization Module 📝

	## Responsibility
	This module handles text summarization and conversion to structured study notes.

	## Functionality
	1. Receive transcribed text from videos.
	2. Use a local mT5 model (map-reduce pipeline) to analyze text and generate structured JSON notes.
	3. Produce clean Markdown output with:
	- Source & Duration header
	- Overall Summary
	- Chronological Timeline (3-7 segments with Key Insight + Why It Matters)
	- Conclusion

	## Files

	### 1. `schemas.py`
	- Purpose: Single source of truth for all Pydantic data models.
	- Key Classes:
	- `SummarySchema` — Full structured output (title, detected_language, summary, segments, conclusion, topics).
	- `SegmentSchema` — A timeline section (title, summary, key_insight, why_it_matters).

	### 2. `note_generator.py`
	- Purpose: Generate notes using a local mT5 model with chunk-based map-reduce and schema validation.
	- Main Class: `NoteGenerator`
	- Key Methods:
	- `generateSummary(transcript, title)` — Generates structured JSON study notes.
	- `format_notes_to_markdown(json_notes)` — Converts JSON to clean Markdown.
	- `format_final_notes(notes, title, url, duration)` — Wraps Markdown with Source/Duration header.

	### 3. `segmenter.py`
	- Purpose: Split long texts into smaller segments for preprocessing.
	- Main Class: `TranscriptSegmenter`
	- Key Methods:
	- `segment_text_by_words()` — Split text into fixed-size word chunks (used by the mT5 pipeline).
	- `segment_by_time()` — Split by time intervals.
	- `segment_by_topic()` — Split by paragraph/topic boundaries.
	- `clean_text()` — Remove filler words.

	## JSON Output Structure
	```json
	{
	"title": "...",
	"detected_language": "English",
	"summary": "Overall summary (3-5 sentences)",
	"segments": [
	{
	"title": "Segment title",
	"summary": "What this section covers",
	"key_insight": "Most important point",
	"why_it_matters": "Why this is valuable"
	}
	],
	"conclusion": "Final takeaway",
	"topics": ["Topic1", "Topic2"]
	}
	```

	> Note: `topics` is hidden metadata — not rendered in markdown, used by downstream modules only.

	## Markdown Output Order
	1. Source — video URL
	2. Duration — video length
	3. Overall Summary — one concise summary
	4. Timeline — chronological segments (3-7), each with Key Insight + Why It Matters
	5. Conclusion — final takeaway

	## Labels (Localized)
	\| Key \| English \| Arabic \|
	\|-----\|---------\|--------\|
	\| source \| Source \| المصدر \|
	\| duration \| Duration \| المدة \|
	\| summary \| Overall Summary \| الملخص العام \|
	\| timeline \| Timeline \| التسلسل الزمني \|
	\| insight \| Key Insight \| أهم نقطة \|
	\| why \| Why It Matters \| لماذا يهم؟ \|
	\| conclusion \| Conclusion \| الخلاصة \|

	## Testing
	```python
	from src.summarization.note_generator import NoteGenerator

	generator = NoteGenerator()
	transcript = "Here is the complete video transcript..."
	title = "Introduction to Python"

	# Generate notes
	summary_json = generator.generateSummary(transcript, title)
	notes_md = generator.format_notes_to_markdown(summary_json)

	print(notes_md)
	```

	## Libraries Used
	- `transformers` — Load and run the local mT5 model (HuggingFace).
	- `sentencepiece` — Tokenizer backend required by mT5.
	- `langdetect` — Automatic language detection for multilingual support.
	- `torch` — PyTorch runtime for model inference.
	- `pydantic` — Data validation and schema enforcement.

	## Environment Variables
	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `MT5_MODEL_NAME` \| `google/mt5-small` \| HuggingFace model ID to load \|