Spaces:

ATInc1
/

AIdea-Server

Running

App Files Files Community

AIdea-Server / src /summarization /README.md

amanyelfiky

Refactor summarization module and update schemas

3c66ec8 17 days ago

preview code

raw

history blame contribute delete

3.18 kB

Summarization Module 📝

Responsibility

This module handles text summarization and conversion to structured study notes.

Functionality

Receive transcribed text from videos.
Use Groq (Llama-3.3-70b-versatile) to analyze text and generate structured JSON notes.
Produce clean Markdown output with:
- Source & Duration header
- Overall Summary
- Chronological Timeline (3-7 segments with Key Insight + Why It Matters)
- Conclusion

Files

1. `schemas.py`

Purpose: Single source of truth for all Pydantic data models.
Key Classes:
- SummarySchema — Full structured output (title, detected_language, summary, segments, conclusion, topics).
- SegmentSchema — A timeline section (title, summary, key_insight, why_it_matters).

2. `note_generator.py`

Purpose: Generate notes using Groq AI with strict JSON enforcement.
Main Class: NoteGenerator
Key Methods:
- generateSummary(transcript, title) — Generates structured JSON study notes.
- format_notes_to_markdown(json_notes) — Converts JSON to clean Markdown.
- format_final_notes(notes, title, url, duration) — Wraps Markdown with Source/Duration header.

3. `segmenter.py`

Purpose: Split long texts into smaller segments for preprocessing.
Main Class: TranscriptSegmenter
Key Methods:
- segment_by_time() — Split by time intervals.
- clean_text() — Remove filler words.

JSON Output Structure

{
    "title": "...",
    "detected_language": "English",
    "summary": "Overall summary (3-5 sentences)",
    "segments": [
        {
            "title": "Segment title",
            "summary": "What this section covers",
            "key_insight": "Most important point",
            "why_it_matters": "Why this is valuable"
        }
    ],
    "conclusion": "Final takeaway",
    "topics": ["Topic1", "Topic2"]
}

Note: topics is hidden metadata — not rendered in markdown, used by downstream modules only.

Markdown Output Order

Source — video URL
Duration — video length
Overall Summary — one concise summary
Timeline — chronological segments (3-7), each with Key Insight + Why It Matters
Conclusion — final takeaway

Labels (Localized)

Key	English	Arabic
source	Source	المصدر
duration	Duration	المدة
summary	Overall Summary	الملخص العام
timeline	Timeline	التسلسل الزمني
insight	Key Insight	أهم نقطة
why	Why It Matters	لماذا يهم؟
conclusion	Conclusion	الخلاصة

Testing

from src.summarization.note_generator import NoteGenerator

generator = NoteGenerator()
transcript = "Here is the complete video transcript..."
title = "Introduction to Python"

# Generate notes
summary_json = generator.generateSummary(transcript, title)
notes_md = generator.format_notes_to_markdown(summary_json)

print(notes_md)

Libraries Used

groq — Communicate with Groq API (Llama-3.3-70b-versatile).
pydantic — Data validation and schema enforcement.

Summarization Module 📝

Responsibility

Functionality

Files

1. schemas.py

2. note_generator.py

3. segmenter.py

JSON Output Structure

Markdown Output Order

Labels (Localized)

Testing

Libraries Used

1. `schemas.py`

2. `note_generator.py`

3. `segmenter.py`