Ahmed Mostafa
server v2
6405808

Categorization Module 🏷️

Responsibility

This module handles automatic categorization of notes.

Functionality

  1. Receive summary text.
  2. Use Google Gemini to analyze content.
  3. Return a single category (e.g., Programming, Medicine, History).

Files

1. categorizer.py

  • Purpose: Categorize text using AI.
  • Main Class: CategorizationService
  • Key Method: categorize_text(text) - Returns category name.

How It Works

  1. Receive Text: Take first 2000 characters from summary.
  2. Send Prompt: Ask Gemini to determine one or two-word category.
  3. Clean Result: Remove periods and capitalize first letter.
  4. Validate: If result is too long (>30 chars), truncate it.

Category Examples

  • Programming - Coding and development tutorials.
  • Medicine - Health and medical content.
  • Business - Business management and entrepreneurship.
  • Science - Physics, chemistry, biology.
  • History - Historical events and civilizations.
  • Personal Development - Self-improvement content.
  • Uncategorized - If categorization fails.

Proposed Enhancements

  • Add predefined list of allowed categories.
  • Use embeddings to improve categorization accuracy.
  • Add support for sub-categories.
  • Store categorization results in database for future analysis.

Testing

from src.ai_modules.categorization.categorizer import CategorizationService

categorizer = CategorizationService()

# Categorize text
text = "This video explains how to build a REST API using FastAPI and Python..."
category = await categorizer.categorize_text(text)

print(f"Category: {category}")  # Output: Programming

Libraries Used

  • google-genai - Communicate with Google Gemini.

Important Notes

  • Currently using gemini-1.5-flash model.
  • If text is too short (<10 chars), returns "Uncategorized".
  • Accuracy can be improved by adding examples in the prompt.

Improving the Prompt

To improve categorization accuracy, you can modify the prompt in the file:

prompt = (
    "Analyze the following text and categorize it into ONE of these categories: "
    "Programming, Medicine, Business, Science, History, Personal Development, Education, Technology. "
    "Return ONLY the category name.\n\n"
    f"Text: {text[:2000]}\n\n"
    "Category:"
)