amanyelfiky's picture
Refactor summarization module and update schemas
3c66ec8

Summarization Module ๐Ÿ“

Responsibility

This module handles text summarization and conversion to structured study notes.

Functionality

  1. Receive transcribed text from videos.
  2. Use Groq (Llama-3.3-70b-versatile) to analyze text and generate structured JSON notes.
  3. Produce clean Markdown output with:
    • Source & Duration header
    • Overall Summary
    • Chronological Timeline (3-7 segments with Key Insight + Why It Matters)
    • Conclusion

Files

1. schemas.py

  • Purpose: Single source of truth for all Pydantic data models.
  • Key Classes:
    • SummarySchema โ€” Full structured output (title, detected_language, summary, segments, conclusion, topics).
    • SegmentSchema โ€” A timeline section (title, summary, key_insight, why_it_matters).

2. note_generator.py

  • Purpose: Generate notes using Groq AI with strict JSON enforcement.
  • Main Class: NoteGenerator
  • Key Methods:
    • generateSummary(transcript, title) โ€” Generates structured JSON study notes.
    • format_notes_to_markdown(json_notes) โ€” Converts JSON to clean Markdown.
    • format_final_notes(notes, title, url, duration) โ€” Wraps Markdown with Source/Duration header.

3. segmenter.py

  • Purpose: Split long texts into smaller segments for preprocessing.
  • Main Class: TranscriptSegmenter
  • Key Methods:
    • segment_by_time() โ€” Split by time intervals.
    • clean_text() โ€” Remove filler words.

JSON Output Structure

{
    "title": "...",
    "detected_language": "English",
    "summary": "Overall summary (3-5 sentences)",
    "segments": [
        {
            "title": "Segment title",
            "summary": "What this section covers",
            "key_insight": "Most important point",
            "why_it_matters": "Why this is valuable"
        }
    ],
    "conclusion": "Final takeaway",
    "topics": ["Topic1", "Topic2"]
}

Note: topics is hidden metadata โ€” not rendered in markdown, used by downstream modules only.

Markdown Output Order

  1. Source โ€” video URL
  2. Duration โ€” video length
  3. Overall Summary โ€” one concise summary
  4. Timeline โ€” chronological segments (3-7), each with Key Insight + Why It Matters
  5. Conclusion โ€” final takeaway

Labels (Localized)

Key English Arabic
source Source ุงู„ู…ุตุฏุฑ
duration Duration ุงู„ู…ุฏุฉ
summary Overall Summary ุงู„ู…ู„ุฎุต ุงู„ุนุงู…
timeline Timeline ุงู„ุชุณู„ุณู„ ุงู„ุฒู…ู†ูŠ
insight Key Insight ุฃู‡ู… ู†ู‚ุทุฉ
why Why It Matters ู„ู…ุงุฐุง ูŠู‡ู…ุŸ
conclusion Conclusion ุงู„ุฎู„ุงุตุฉ

Testing

from src.summarization.note_generator import NoteGenerator

generator = NoteGenerator()
transcript = "Here is the complete video transcript..."
title = "Introduction to Python"

# Generate notes
summary_json = generator.generateSummary(transcript, title)
notes_md = generator.format_notes_to_markdown(summary_json)

print(notes_md)

Libraries Used

  • groq โ€” Communicate with Groq API (Llama-3.3-70b-versatile).
  • pydantic โ€” Data validation and schema enforcement.