prediction_api / MODELS_DOCUMENTATION.txt
Vedang2004's picture
Upload folder using huggingface_hub
4847e7d verified
================================================================================
MODELS DOCUMENTATION - Solar Project
================================================================================
Generated on: February 13, 2026
This document provides a comprehensive overview of all Django models used in
the solar_project codebase, including their purpose and field definitions.
================================================================================
MODEL 1: Page
--------------------------------------------------------------------------------
Location: solar_api/models.py
Database Table: pages
DESCRIPTION:
Model representing a page (URL) that has been indexed. This model is used
to track web pages that have been crawled and indexed, typically for RAG
(Retrieval-Augmented Generation) functionality. It maintains information
about which URLs have been processed and their current status.
FIELDS:
1. id (AutoField - Primary Key)
- Automatically generated unique identifier
- Type: Integer
- Auto-increment
2. url (TextField)
- The complete URL of the indexed page
- Type: Text (unlimited length)
- Unique: Yes
- Indexed: Yes (for fast lookups)
- Purpose: Stores the web page URL that was crawled
3. tenant_id (TextField)
- Identifier for multi-tenant support
- Type: Text
- Indexed: Yes
- Purpose: Allows multiple tenants/organizations to use the system
with isolated data
4. content_hash (TextField)
- Hash of the page content
- Type: Text
- Purpose: Used to detect if page content has changed since last crawl
(for efficient re-indexing)
5. is_active (BooleanField)
- Indicates if the page is currently active/valid
- Type: Boolean (True/False)
- Default: True
- Indexed: Yes
- Purpose: Allows soft-deletion or deactivation of pages without
removing them from the database
6. last_indexed (DateTimeField)
- Timestamp of when the page was last indexed
- Type: DateTime
- Default: Current time (timezone.now)
- Purpose: Track freshness of indexed content
INDEXES:
- Composite index on (tenant_id, is_active) for efficient tenant queries
- Index on url field
- Index on is_active field
================================================================================
MODEL 2: Document
--------------------------------------------------------------------------------
Location: solar_api/models.py
Database Table: documents
DESCRIPTION:
Model representing a document chunk with its embedding. This model stores
chunks of text content along with their vector embeddings for semantic
search functionality. Each document is a piece of content extracted from
a page, processed and stored with its vector representation for RAG
(Retrieval-Augmented Generation) operations.
FIELDS:
1. id (AutoField - Primary Key)
- Automatically generated unique identifier
- Type: Integer
- Auto-increment
2. content (TextField)
- The actual text content of the document chunk
- Type: Text (unlimited length)
- Purpose: Stores the chunked text that will be used for retrieval
and context generation
3. source (TextField)
- Source information about where the content came from
- Type: Text
- Purpose: Track the origin of the document (e.g., filename, URL)
4. page_url (TextField)
- URL of the page this document chunk belongs to
- Type: Text
- Indexed: Yes
- Purpose: Link the document chunk back to its source page
(relates to the Page model)
5. embedding (TextField)
- Vector embedding of the document content
- Type: Text (stored as JSON array)
- Purpose: Stores the 768-dimensional vector representation of the
content for semantic similarity searches
- Note: Designed for PostgreSQL's pgvector extension (vector(768))
Currently stored as JSON array for compatibility
6. hash (TextField)
- Unique hash of the document content
- Type: Text
- Unique: Yes
- Indexed: Yes
- Purpose: Prevent duplicate document chunks from being stored
and enable fast duplicate detection
INDEXES:
- Index on page_url field (for fast page-based queries)
- Index on hash field (for duplicate detection)
SPECIAL NOTES:
- The embedding field is designed to work with PostgreSQL's pgvector
extension which provides efficient vector similarity search
- The 768-dimension vector size is standard for many embedding models
(e.g., sentence-transformers)
- Raw SQL may be used for vector operations (cosine similarity, etc.)
================================================================================
RELATIONSHIPS BETWEEN MODELS:
--------------------------------------------------------------------------------
Page <---> Document
- One Page can have multiple Documents (One-to-Many relationship)
- Documents are linked to Pages via the page_url field
- This is a logical relationship (not enforced by ForeignKey in the code)
- When a page is crawled, its content is split into chunks, and each
chunk becomes a Document with a reference to the parent Page's URL
================================================================================
COMMON USE CASES:
--------------------------------------------------------------------------------
1. Web Crawling & Indexing:
- Create Page records for discovered URLs
- Extract content and create Document chunks
- Store embeddings for semantic search
2. RAG (Retrieval-Augmented Generation):
- Query Documents using vector similarity
- Retrieve relevant context for chatbot responses
- Use page_url to trace back to original sources
3. Multi-Tenant Support:
- Filter Pages by tenant_id
- Each tenant has isolated set of pages and documents
4. Content Freshness:
- Check last_indexed to determine if re-indexing is needed
- Compare content_hash to detect changes
5. Deduplication:
- Use Document.hash to prevent storing duplicate chunks
- Use Page.content_hash to detect page changes
================================================================================
END OF DOCUMENTATION
================================================================================