Spaces:

Vedang2004
/

prediction_api

Running

App Files Files Community

prediction_api / MODELS_DOCUMENTATION.txt

Vedang2004

Upload folder using huggingface_hub

4847e7d verified 2 months ago

raw

history blame contribute delete

6.86 kB

	================================================================================
	MODELS DOCUMENTATION - Solar Project
	================================================================================
	Generated on: February 13, 2026

	This document provides a comprehensive overview of all Django models used in
	the solar_project codebase, including their purpose and field definitions.

	================================================================================

	MODEL 1: Page
	--------------------------------------------------------------------------------
	Location: solar_api/models.py
	Database Table: pages

	DESCRIPTION:
	Model representing a page (URL) that has been indexed. This model is used
	to track web pages that have been crawled and indexed, typically for RAG
	(Retrieval-Augmented Generation) functionality. It maintains information
	about which URLs have been processed and their current status.

	FIELDS:
	1. id (AutoField - Primary Key)
	- Automatically generated unique identifier
	- Type: Integer
	- Auto-increment

	2. url (TextField)
	- The complete URL of the indexed page
	- Type: Text (unlimited length)
	- Unique: Yes
	- Indexed: Yes (for fast lookups)
	- Purpose: Stores the web page URL that was crawled

	3. tenant_id (TextField)
	- Identifier for multi-tenant support
	- Type: Text
	- Indexed: Yes
	- Purpose: Allows multiple tenants/organizations to use the system
	with isolated data

	4. content_hash (TextField)
	- Hash of the page content
	- Type: Text
	- Purpose: Used to detect if page content has changed since last crawl
	(for efficient re-indexing)

	5. is_active (BooleanField)
	- Indicates if the page is currently active/valid
	- Type: Boolean (True/False)
	- Default: True
	- Indexed: Yes
	- Purpose: Allows soft-deletion or deactivation of pages without
	removing them from the database

	6. last_indexed (DateTimeField)
	- Timestamp of when the page was last indexed
	- Type: DateTime
	- Default: Current time (timezone.now)
	- Purpose: Track freshness of indexed content

	INDEXES:
	- Composite index on (tenant_id, is_active) for efficient tenant queries
	- Index on url field
	- Index on is_active field

	================================================================================

	MODEL 2: Document
	--------------------------------------------------------------------------------
	Location: solar_api/models.py
	Database Table: documents

	DESCRIPTION:
	Model representing a document chunk with its embedding. This model stores
	chunks of text content along with their vector embeddings for semantic
	search functionality. Each document is a piece of content extracted from
	a page, processed and stored with its vector representation for RAG
	(Retrieval-Augmented Generation) operations.

	FIELDS:
	1. id (AutoField - Primary Key)
	- Automatically generated unique identifier
	- Type: Integer
	- Auto-increment

	2. content (TextField)
	- The actual text content of the document chunk
	- Type: Text (unlimited length)
	- Purpose: Stores the chunked text that will be used for retrieval
	and context generation

	3. source (TextField)
	- Source information about where the content came from
	- Type: Text
	- Purpose: Track the origin of the document (e.g., filename, URL)

	4. page_url (TextField)
	- URL of the page this document chunk belongs to
	- Type: Text
	- Indexed: Yes
	- Purpose: Link the document chunk back to its source page
	(relates to the Page model)

	5. embedding (TextField)
	- Vector embedding of the document content
	- Type: Text (stored as JSON array)
	- Purpose: Stores the 768-dimensional vector representation of the
	content for semantic similarity searches
	- Note: Designed for PostgreSQL's pgvector extension (vector(768))
	Currently stored as JSON array for compatibility

	6. hash (TextField)
	- Unique hash of the document content
	- Type: Text
	- Unique: Yes
	- Indexed: Yes
	- Purpose: Prevent duplicate document chunks from being stored
	and enable fast duplicate detection

	INDEXES:
	- Index on page_url field (for fast page-based queries)
	- Index on hash field (for duplicate detection)

	SPECIAL NOTES:
	- The embedding field is designed to work with PostgreSQL's pgvector
	extension which provides efficient vector similarity search
	- The 768-dimension vector size is standard for many embedding models
	(e.g., sentence-transformers)
	- Raw SQL may be used for vector operations (cosine similarity, etc.)

	================================================================================

	RELATIONSHIPS BETWEEN MODELS:
	--------------------------------------------------------------------------------
	Page <---> Document

	- One Page can have multiple Documents (One-to-Many relationship)
	- Documents are linked to Pages via the page_url field
	- This is a logical relationship (not enforced by ForeignKey in the code)
	- When a page is crawled, its content is split into chunks, and each
	chunk becomes a Document with a reference to the parent Page's URL

	================================================================================

	COMMON USE CASES:
	--------------------------------------------------------------------------------
	1. Web Crawling & Indexing:
	- Create Page records for discovered URLs
	- Extract content and create Document chunks
	- Store embeddings for semantic search

	2. RAG (Retrieval-Augmented Generation):
	- Query Documents using vector similarity
	- Retrieve relevant context for chatbot responses
	- Use page_url to trace back to original sources

	3. Multi-Tenant Support:
	- Filter Pages by tenant_id
	- Each tenant has isolated set of pages and documents

	4. Content Freshness:
	- Check last_indexed to determine if re-indexing is needed
	- Compare content_hash to detect changes

	5. Deduplication:
	- Use Document.hash to prevent storing duplicate chunks
	- Use Page.content_hash to detect page changes

	================================================================================
	END OF DOCUMENTATION
	================================================================================