Spaces:

Vedang2004
/

prediction_api

Running

File size: 6,856 Bytes

4847e7d

================================================================================
                    MODELS DOCUMENTATION - Solar Project
================================================================================
Generated on: February 13, 2026

This document provides a comprehensive overview of all Django models used in 
the solar_project codebase, including their purpose and field definitions.

================================================================================

MODEL 1: Page
--------------------------------------------------------------------------------
Location: solar_api/models.py
Database Table: pages

DESCRIPTION:
    Model representing a page (URL) that has been indexed. This model is used
    to track web pages that have been crawled and indexed, typically for RAG
    (Retrieval-Augmented Generation) functionality. It maintains information
    about which URLs have been processed and their current status.

FIELDS:
    1. id (AutoField - Primary Key)
       - Automatically generated unique identifier
       - Type: Integer
       - Auto-increment

    2. url (TextField)
       - The complete URL of the indexed page
       - Type: Text (unlimited length)
       - Unique: Yes
       - Indexed: Yes (for fast lookups)
       - Purpose: Stores the web page URL that was crawled

    3. tenant_id (TextField)
       - Identifier for multi-tenant support
       - Type: Text
       - Indexed: Yes
       - Purpose: Allows multiple tenants/organizations to use the system
                 with isolated data

    4. content_hash (TextField)
       - Hash of the page content
       - Type: Text
       - Purpose: Used to detect if page content has changed since last crawl
                 (for efficient re-indexing)

    5. is_active (BooleanField)
       - Indicates if the page is currently active/valid
       - Type: Boolean (True/False)
       - Default: True
       - Indexed: Yes
       - Purpose: Allows soft-deletion or deactivation of pages without
                 removing them from the database

    6. last_indexed (DateTimeField)
       - Timestamp of when the page was last indexed
       - Type: DateTime
       - Default: Current time (timezone.now)
       - Purpose: Track freshness of indexed content

INDEXES:
    - Composite index on (tenant_id, is_active) for efficient tenant queries
    - Index on url field
    - Index on is_active field

================================================================================

MODEL 2: Document
--------------------------------------------------------------------------------
Location: solar_api/models.py
Database Table: documents

DESCRIPTION:
    Model representing a document chunk with its embedding. This model stores
    chunks of text content along with their vector embeddings for semantic
    search functionality. Each document is a piece of content extracted from
    a page, processed and stored with its vector representation for RAG
    (Retrieval-Augmented Generation) operations.

FIELDS:
    1. id (AutoField - Primary Key)
       - Automatically generated unique identifier
       - Type: Integer
       - Auto-increment

    2. content (TextField)
       - The actual text content of the document chunk
       - Type: Text (unlimited length)
       - Purpose: Stores the chunked text that will be used for retrieval
                 and context generation

    3. source (TextField)
       - Source information about where the content came from
       - Type: Text
       - Purpose: Track the origin of the document (e.g., filename, URL)

    4. page_url (TextField)
       - URL of the page this document chunk belongs to
       - Type: Text
       - Indexed: Yes
       - Purpose: Link the document chunk back to its source page
                 (relates to the Page model)

    5. embedding (TextField)
       - Vector embedding of the document content
       - Type: Text (stored as JSON array)
       - Purpose: Stores the 768-dimensional vector representation of the
                 content for semantic similarity searches
       - Note: Designed for PostgreSQL's pgvector extension (vector(768))
               Currently stored as JSON array for compatibility

    6. hash (TextField)
       - Unique hash of the document content
       - Type: Text
       - Unique: Yes
       - Indexed: Yes
       - Purpose: Prevent duplicate document chunks from being stored
                 and enable fast duplicate detection

INDEXES:
    - Index on page_url field (for fast page-based queries)
    - Index on hash field (for duplicate detection)

SPECIAL NOTES:
    - The embedding field is designed to work with PostgreSQL's pgvector
      extension which provides efficient vector similarity search
    - The 768-dimension vector size is standard for many embedding models
      (e.g., sentence-transformers)
    - Raw SQL may be used for vector operations (cosine similarity, etc.)

================================================================================

RELATIONSHIPS BETWEEN MODELS:
--------------------------------------------------------------------------------
    Page <---> Document
    
    - One Page can have multiple Documents (One-to-Many relationship)
    - Documents are linked to Pages via the page_url field
    - This is a logical relationship (not enforced by ForeignKey in the code)
    - When a page is crawled, its content is split into chunks, and each
      chunk becomes a Document with a reference to the parent Page's URL

================================================================================

COMMON USE CASES:
--------------------------------------------------------------------------------
    1. Web Crawling & Indexing:
       - Create Page records for discovered URLs
       - Extract content and create Document chunks
       - Store embeddings for semantic search

    2. RAG (Retrieval-Augmented Generation):
       - Query Documents using vector similarity
       - Retrieve relevant context for chatbot responses
       - Use page_url to trace back to original sources

    3. Multi-Tenant Support:
       - Filter Pages by tenant_id
       - Each tenant has isolated set of pages and documents

    4. Content Freshness:
       - Check last_indexed to determine if re-indexing is needed
       - Compare content_hash to detect changes

    5. Deduplication:
       - Use Document.hash to prevent storing duplicate chunks
       - Use Page.content_hash to detect page changes

================================================================================
                              END OF DOCUMENTATION
================================================================================