Spaces:
Running
Running
File size: 6,856 Bytes
4847e7d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ================================================================================
MODELS DOCUMENTATION - Solar Project
================================================================================
Generated on: February 13, 2026
This document provides a comprehensive overview of all Django models used in
the solar_project codebase, including their purpose and field definitions.
================================================================================
MODEL 1: Page
--------------------------------------------------------------------------------
Location: solar_api/models.py
Database Table: pages
DESCRIPTION:
Model representing a page (URL) that has been indexed. This model is used
to track web pages that have been crawled and indexed, typically for RAG
(Retrieval-Augmented Generation) functionality. It maintains information
about which URLs have been processed and their current status.
FIELDS:
1. id (AutoField - Primary Key)
- Automatically generated unique identifier
- Type: Integer
- Auto-increment
2. url (TextField)
- The complete URL of the indexed page
- Type: Text (unlimited length)
- Unique: Yes
- Indexed: Yes (for fast lookups)
- Purpose: Stores the web page URL that was crawled
3. tenant_id (TextField)
- Identifier for multi-tenant support
- Type: Text
- Indexed: Yes
- Purpose: Allows multiple tenants/organizations to use the system
with isolated data
4. content_hash (TextField)
- Hash of the page content
- Type: Text
- Purpose: Used to detect if page content has changed since last crawl
(for efficient re-indexing)
5. is_active (BooleanField)
- Indicates if the page is currently active/valid
- Type: Boolean (True/False)
- Default: True
- Indexed: Yes
- Purpose: Allows soft-deletion or deactivation of pages without
removing them from the database
6. last_indexed (DateTimeField)
- Timestamp of when the page was last indexed
- Type: DateTime
- Default: Current time (timezone.now)
- Purpose: Track freshness of indexed content
INDEXES:
- Composite index on (tenant_id, is_active) for efficient tenant queries
- Index on url field
- Index on is_active field
================================================================================
MODEL 2: Document
--------------------------------------------------------------------------------
Location: solar_api/models.py
Database Table: documents
DESCRIPTION:
Model representing a document chunk with its embedding. This model stores
chunks of text content along with their vector embeddings for semantic
search functionality. Each document is a piece of content extracted from
a page, processed and stored with its vector representation for RAG
(Retrieval-Augmented Generation) operations.
FIELDS:
1. id (AutoField - Primary Key)
- Automatically generated unique identifier
- Type: Integer
- Auto-increment
2. content (TextField)
- The actual text content of the document chunk
- Type: Text (unlimited length)
- Purpose: Stores the chunked text that will be used for retrieval
and context generation
3. source (TextField)
- Source information about where the content came from
- Type: Text
- Purpose: Track the origin of the document (e.g., filename, URL)
4. page_url (TextField)
- URL of the page this document chunk belongs to
- Type: Text
- Indexed: Yes
- Purpose: Link the document chunk back to its source page
(relates to the Page model)
5. embedding (TextField)
- Vector embedding of the document content
- Type: Text (stored as JSON array)
- Purpose: Stores the 768-dimensional vector representation of the
content for semantic similarity searches
- Note: Designed for PostgreSQL's pgvector extension (vector(768))
Currently stored as JSON array for compatibility
6. hash (TextField)
- Unique hash of the document content
- Type: Text
- Unique: Yes
- Indexed: Yes
- Purpose: Prevent duplicate document chunks from being stored
and enable fast duplicate detection
INDEXES:
- Index on page_url field (for fast page-based queries)
- Index on hash field (for duplicate detection)
SPECIAL NOTES:
- The embedding field is designed to work with PostgreSQL's pgvector
extension which provides efficient vector similarity search
- The 768-dimension vector size is standard for many embedding models
(e.g., sentence-transformers)
- Raw SQL may be used for vector operations (cosine similarity, etc.)
================================================================================
RELATIONSHIPS BETWEEN MODELS:
--------------------------------------------------------------------------------
Page <---> Document
- One Page can have multiple Documents (One-to-Many relationship)
- Documents are linked to Pages via the page_url field
- This is a logical relationship (not enforced by ForeignKey in the code)
- When a page is crawled, its content is split into chunks, and each
chunk becomes a Document with a reference to the parent Page's URL
================================================================================
COMMON USE CASES:
--------------------------------------------------------------------------------
1. Web Crawling & Indexing:
- Create Page records for discovered URLs
- Extract content and create Document chunks
- Store embeddings for semantic search
2. RAG (Retrieval-Augmented Generation):
- Query Documents using vector similarity
- Retrieve relevant context for chatbot responses
- Use page_url to trace back to original sources
3. Multi-Tenant Support:
- Filter Pages by tenant_id
- Each tenant has isolated set of pages and documents
4. Content Freshness:
- Check last_indexed to determine if re-indexing is needed
- Compare content_hash to detect changes
5. Deduplication:
- Use Document.hash to prevent storing duplicate chunks
- Use Page.content_hash to detect page changes
================================================================================
END OF DOCUMENTATION
================================================================================
|