Spaces:
Sleeping
Sleeping
| # VGEC RAG Chatbot β Codebase Documentation | |
| > **Generated:** 2026-03-25 | |
| > **Version:** 1.0.0 | |
| > **Scope:** Full system β ingestion, retrieval, classification, API, evaluation | |
| --- | |
| ## Table of Contents | |
| 1. [Project Overview](#1-project-overview) | |
| 2. [System Architecture](#2-system-architecture) | |
| 3. [Schema & Data Model](#3-schema--data-model) | |
| 4. [Retrieval Pipeline](#4-retrieval-pipeline) | |
| 5. [Key Classes & Modules](#5-key-classes--modules) | |
| 6. [Evaluation & Metrics](#6-evaluation--metrics) | |
| 7. [Known Limitations](#7-known-limitations) | |
| 8. [File Structure](#8-file-structure) | |
| --- | |
| ## 1. Project Overview | |
| ### Purpose | |
| **VGEC RAG Chatbot** is a Retrieval-Augmented Generation (RAG) chatbot for **Vishwakarma Government Engineering College (VGEC), Chandkheda, Gujarat**. It allows students, faculty, and visitors to query structured information about the institution β departments, faculty, syllabus, labs, intake capacity, and more β through natural language. | |
| ### Domain | |
| - **Institution:** VGEC (Government Engineering College, Gujarat) | |
| - **Data Coverage:** Department-level information for multiple disciplines (Computer Engineering, Civil, Electrical, IT, ECE, etc.) | |
| - **Topics:** Faculty lists, lab facilities, syllabus details, HOD info, research activities, intake capacity, achievements | |
| ### Tech Stack | |
| | Layer | Technology | | |
| |---|---| | |
| | **API Framework** | FastAPI | | |
| | **Vector Database** | ChromaDB (persistent, local) | | |
| | **Embeddings** | Google `gemini-embedding-001` (via `langchain-google-genai`) | | |
| | **LLM (Cloud)** | Google Gemini `gemini-2.5-flash-lite` | | |
| | **LLM (Local)** | `EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf` via `llama-cpp-python` | | |
| | **NLP / Preprocessing** | spaCy (`en_core_web_sm`), NLTK (PorterStemmer) | | |
| | **Classifier** | Scikit-learn `LogisticRegression` + `SentenceTransformer` (`MongoDB/mdbr-leaf-mt`) | | |
| | **BM25** | `langchain-community` `BM25Retriever` | | |
| | **Chunking** | LangChain `RecursiveCharacterTextSplitter` | | |
| | **Config** | Pydantic `BaseSettings` (`.env`-backed) | | |
| ### Key Features Implemented | |
| - β Structured JSON ingestion with intent-aware chunking | |
| - β Hybrid retrieval: BM25 + vector search fused via Reciprocal Rank Fusion (RRF) | |
| - β Intent/metadata classification with confidence-gated ChromaDB filters | |
| - β Abbreviation expansion (`CE` β `Computer Engineering`, etc.) | |
| - β Multi-turn conversation history support | |
| - β Dual LLM backend with automatic fallback (Gemini β Local) | |
| - β Full CRUD REST API for vector store management | |
| - β Offline evaluation endpoint (MRR, hit rate, noise rate) | |
| - β Classifier accuracy evaluation endpoint | |
| --- | |
| ## 2. System Architecture | |
| ### Component Diagram | |
| ``` | |
| ββββββββββββββββββββββββββββ | |
| β FastAPI App β | |
| β /api/v1/rag /vector β | |
| ββββββββββββ¬ββββββββββββββββ | |
| β DI (lru_cache) | |
| ββββββββββββΌββββββββββββββββ | |
| β RAGService β | |
| β (core orchestrator) β | |
| ββββ¬ββββββββββββ¬βββββββββββββ | |
| β β | |
| βββββββββββββββΌβββ βββββΌβββββββββββββββββββ | |
| β IngestionServiceβ β HybridRetrievalServiceβ | |
| β (write path) β β (read path) β | |
| ββββββββ¬ββββββββ β βββββ¬βββββββββββ¬βββββββ β | |
| β β β | |
| ββββββββββββΌβββ ββββββββββββΌβββ ββββββΌβββββββββββ | |
| β FileService β β ClassifierSvcβ β VectorStore β | |
| β (file +meta) β β(clf predict) β β (ChromaDB) β | |
| ββββββββββββββββ βββββββββββββββ βββββββββββββββββ | |
| ``` | |
| ### Data Flow | |
| #### Ingestion Path | |
| ``` | |
| File Upload (PDF/MD/TXT/JSON) | |
| β | |
| βΌ | |
| FileService.read_file() β type-aware loading (PyMuPDF for PDF) | |
| β returns: Document + metadata | |
| βΌ | |
| FileService.write_file() β persist copy to data/documents/ | |
| β | |
| βΌ | |
| IngestionService.handle_*_docs() β route by file extension | |
| β | |
| ββ JSON β handle_json_docs() β intent-aware chunks (list / detail / count) | |
| ββ text β handle_text_docs() β RecursiveCharacterTextSplitter + normalize() | |
| β | |
| βΌ | |
| VectorStore.add_documents() β embed + upsert into ChromaDB | |
| β | |
| βΌ | |
| FileService.patch_metadata() β update ingestion record JSON (chunk count, timing, size) | |
| ``` | |
| #### Query Path | |
| ``` | |
| User Question | |
| β | |
| βΌ | |
| preprocess_query() β tokenize + strip stopwords (spaCy) + normalize | |
| β | |
| βΌ | |
| HybridRetrievalService.retrieve() | |
| β | |
| ββ clf.expand_abbreviations() β CE β Computer Engineering | |
| ββ clf.predict_with_filter() β LogReg predict β Chroma $and/$or filter | |
| ββ _vector_rank() β ChromaDB similarity_search_with_score (k=15) | |
| ββ _bm25_rank() β BM25 over the vector candidate pool | |
| ββ _reciprocal_rank_fusion() β weighted RRF merge | |
| ββ metadata score boosting β multiply fused scores for confident matches | |
| ββ _apply_title_boost() β per-query-word title match bonus | |
| β | |
| βΌ | |
| get_references_v2() β filter by threshold, build context string | |
| β | |
| βΌ | |
| LLM.invoke(prompt) β Gemini or local LlamaCpp | |
| β | |
| βΌ | |
| Return: { answer, references, context, threshold_used, k_used } | |
| ``` | |
| ### External Dependencies | |
| | Dependency | Role | Provider | | |
| |---|---|---| | |
| | ChromaDB | Persistent vector store | Local disk | | |
| | Google Gemini API | Embeddings + LLM generation | Google Cloud | | |
| | LlamaCpp (GGUF model) | Local LLM fallback | Local CPU | | |
| | Sentence Transformers | Classifier feature extraction | HuggingFace Hub | | |
| | spaCy `en_core_web_sm` | POS tagging / lemmatization | Local | | |
| --- | |
| ## 3. Schema & Data Model | |
| ### Source JSON Format | |
| Source data files (e.g. `computer_eng.json`) follow this schema: | |
| ```json | |
| { | |
| "id": "computer-engineering-department", | |
| "name": "Computer Engineering Department", | |
| "source": "https://www.vgecg.ac.in/department.php?dept=3", | |
| "category": "computer_eng", | |
| "type": "department", | |
| "created_date": "2026-02-19", | |
| "content": { | |
| "<topic_key>": { | |
| "list": ["item 1", "item 2", "..."], | |
| "details": "Paragraph describing the topic." | |
| } | |
| } | |
| } | |
| ``` | |
| **Top-level fields:** | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `id` | string | Unique document identifier | | |
| | `name` | string | Human-readable institution/department name | | |
| | `source` | string | Authoritative URL | | |
| | `category` | string | Department slug (e.g. `computer_eng`) | | |
| | `type` | string | Document type (e.g. `department`) | | |
| | `created_date` | string (ISO) | Data creation date | | |
| | `content` | object | Topic map; each key = a topic | | |
| ### Chunk Metadata Schema (stored in ChromaDB) | |
| Every vector chunk stored in Chroma carries the following metadata: | |
| | Field | Type | Source | | |
| |---|---|---| | |
| | `id` | string (UUID) | Auto-generated | | |
| | `title` | string | Document name / topic key | | |
| | `source` | string | Source URL | | |
| | `source_file` | string | Filename (e.g. `computer_eng.json`) | | |
| | `type` | string | Taxonomy level 1 (e.g. `department`) | | |
| | `category` | string | Taxonomy level 2 (e.g. `computer_eng`) | | |
| | `topic` | string | Taxonomy level 3 (e.g. `faculty`) | | |
| | `intent` | string | Chunk intent: `list`, `detail`, or `count` | | |
| | `chunk_index` | int | Sequential index within file | | |
| | `created_date` | string (ISO) | Ingestion timestamp | | |
| | `updated_at` | string (ISO) | Last modification timestamp | | |
| | `ext` | string | Source file extension (`json`, `pdf`, `md`, `txt`) | | |
| ### Hierarchical Taxonomy | |
| The classifier predicts and ChromaDB filters operate on a 3-level hierarchy: | |
| ``` | |
| type | |
| βββ category | |
| βββ topic | |
| βββ intent (list | detail | count) | |
| ``` | |
| **Example mapping (Computer Engineering):** | |
| ``` | |
| type: "department" | |
| βββ category: "computer_eng" | |
| βββ topic: "faculty" β intent: list | detail | |
| βββ topic: "lab" β intent: list | detail | |
| βββ topic: "syllabus" β intent: list | detail | |
| βββ topic: "hod" β intent: list | detail | |
| βββ topic: "intake" β intent: list | detail | |
| βββ topic: "research" β intent: list | detail | |
| βββ topic: "achievements" | |
| ``` | |
| ### Document Chunking Strategy | |
| **JSON documents** use a hand-crafted, intent-aware strategy in `IngestionService.handle_json_docs()`: | |
| | Intent | Chunk Content | Metadata | | |
| |---|---|---| | |
| | `list` | Numbered list: `1. item\n2. item\n...` | `intent=list` | | |
| | `count` | `"Total <topic>: N"` (auto-generated) | `intent=count` | | |
| | `detail` | Raw paragraph text | `intent=detail` | | |
| **Text/PDF/Markdown documents** use `RecursiveCharacterTextSplitter`: | |
| - Default: `chunk_size=500`, `chunk_overlap=100` | |
| - Separator priority: `\n\n` β `\n` β ` ` β (character) | |
| - Markdown variant respects `---` section delimiters | |
| - Content is passed through `normalize()` (tokenize + strip blanks) before storage | |
| --- | |
| ## 4. Retrieval Pipeline | |
| ### Query Processing Flow | |
| ```python | |
| # Step 1: Normalize input | |
| question = preprocess_query(question) | |
| # β spaCy POS filter (NOUN, PROPN, VERB, NUM, ADJ) + lemmatize + strip stopwords | |
| # Step 2: Expand abbreviations | |
| processed_query = clf.expand_abbreviations(query) | |
| # β "CE dept" β "computer engineering department" | |
| # Step 3: Classify intent/metadata | |
| filters = clf.predict_with_filter([processed_query]) | |
| # β {"$and": [{"type": "department"}, {"intent": "list"}, {"$or": [...]}]} | |
| # Step 4: Vector search with optional filter | |
| raw_results = chroma.similarity_search_with_score(query, k=15, filter=filters) | |
| # Fallback: if filtered results empty, retry without filter | |
| # Step 5: BM25 re-rank over vector candidates | |
| bm25_results = BM25Retriever.from_documents(candidate_docs) | |
| # Step 6: RRF fusion | |
| fused_score(d) = bm25_weight * 1/(rrf_k + rank_bm25) | |
| + vector_weight * 1/(rrf_k + rank_vec) | |
| # Step 7: Metadata confidence boosting | |
| if doc.metadata[field] == predicted_val and conf > 0.90: | |
| result.fused_score *= boost_factor # 1.10β1.20 | |
| # Step 8: Title word boost | |
| for word in query_words: | |
| if word in doc.title: | |
| result.fused_score += title_boost_per_word # 0.004 | |
| # Step 9: Threshold filter + sort + top-k | |
| results = [r for r in results if r.fused_score >= threshold] | |
| ``` | |
| ### Classifier Thresholds | |
| The `Classifier` uses two separate threshold tables: | |
| **Prediction threshold** β below this, the field is set to `None` (not used at all): | |
| | Field | Threshold | | |
| |---|---| | |
| | `type` | 0.40 | | |
| | `category` | 0.40 | | |
| | `topic` | 0.50 | | |
| | `intent` | 0.60 | | |
| **Filter threshold** β above this, the field becomes a hard ChromaDB `$and` filter: | |
| | Field | Threshold | | |
| |---|---| | |
| | `type` | 0.65 | | |
| | `category` | 0.65 | | |
| | `topic` | 0.70 | | |
| ### Filter Construction Logic (`_build_filter`) | |
| ```python | |
| # Gate: if type confidence < 0.65 β return None (full scan) | |
| # Hard anchors (always included if type passes): | |
| # - type == predicted_type | |
| # - intent == predicted_intent (special: "count" expands to count OR detail) | |
| # Soft hints (combined as $or): | |
| # - category == predicted_category (if conf >= 0.65, else "general") | |
| # - topic == predicted_topic (if conf >= 0.70, else "general") | |
| ``` | |
| ### Hybrid Retrieval Config (Defaults) | |
| | Parameter | `hybrid_query` | `search_docs` | | |
| |---|---|---| | |
| | `candidate_k` | 15 | 15 | | |
| | `top_k` (final) | `settings.similarity_top_k` (8) | k (param) | | |
| | `bm25_weight` | 0.45 | 0.70 | | |
| | `vector_weight` | 0.55 | 0.30 | | |
| | `rrf_k` | 20 | 20 | | |
| | `bm25_k1` | 1.2 | 1.5 | | |
| | `bm25_b` | 0.9 | 0.75 | | |
| | `title_boost_per_word` | 0.004 | 0.004 | | |
| | `score_threshold` | 0.4 | 0.4 | | |
| > **Note:** `search_docs` is BM25-heavy (0.70) since it is used for keyword-oriented document browsing, while `hybrid_query` is vector-heavy for semantic QA. | |
| --- | |
| ## 5. Key Classes & Modules | |
| ### Services (`app/services/`) | |
| #### `RAGService` | |
| Main orchestrator. Singleton via `lru_cache` in `dependencies.py`. | |
| | Method | Description | | |
| |---|---| | |
| | `query()` | Semantic-only QA (vector search β LLM) | | |
| | `hybrid_query()` | Hybrid QA (BM25 + vector β RRF β LLM) | | |
| | `search_docs()` | BM25-heavy document search, no LLM | | |
| | `ingest_documents()` | Ingest a file path into the vector store | | |
| | `get_filenames()` | Return all tracked file metadata records | | |
| | `test_queries()` | Batch retrieval evaluation (MRR, precision, noise) | | |
| | `test_classifier()` | Batch classifier accuracy evaluation | | |
| | `delete_database()` | Drop the entire ChromaDB collection | | |
| #### `HybridRetrievalService` | |
| Stateless per-request service created inline by `RAGService`. | |
| | Method | Description | | |
| |---|---| | |
| | `retrieve(query)` | Full hybrid retrieval pipeline; returns `List[RetrievalResult]` | | |
| | `_vector_rank()` | Chroma similarity search + classifier filter | | |
| | `_bm25_rank()` | BM25 over candidate pool | | |
| | `_reciprocal_rank_fusion()` | Merge both ranked lists via RRF | | |
| | `_apply_title_boost()` | Word-level title match score bonus | | |
| **`RetrievalResult` dataclass:** | |
| ```python | |
| @dataclass | |
| class RetrievalResult: | |
| document: Document | |
| fused_score: float | |
| bm25_rank: Optional[int] | |
| vector_rank: Optional[int] | |
| title_boost: float | |
| ``` | |
| #### `Classifier` | |
| Loaded at startup from a pickled pipeline (`chatbot_classifier.pkl`). | |
| | Method | Description | | |
| |---|---| | |
| | `predict(queries)` | Returns list of `{type, category, topic, intent, *_conf}` dicts | | |
| | `predict_with_filter(queries)` | Returns a ChromaDB-compatible filter dict or `None` | | |
| | `expand_abbreviations(text)` | Regex-based abbreviation expansion | | |
| | `get_features(queries)` | Build `[SentenceTransformer embedding | TF-IDF]` feature matrix | | |
| | `train_models(df)` | Train 4 LogisticRegression classifiers (offline use) | | |
| #### `IngestionService` | |
| | Method | Description | | |
| |---|---| | |
| | `ingest(file_path)` | Load + chunk a file; returns `List[Document]` | | |
| | `handle_json_docs()` | Intent-aware chunking for structured JSON data | | |
| | `handle_text_docs()` | Recursive character splitting for unstructured text | | |
| | `get_records()` | Delegate to `FileService.get_records()` | | |
| | `delete_record(filename)` | Remove a file's metadata record | | |
| | `path_record(path, metadata)` | Patch ingestion stats after indexing | | |
| #### `FileService` | |
| | Method | Description | | |
| |---|---| | |
| | `read_file(path)` | Load file content; dispatches by extension | | |
| | `write_file(path, content, metadata)` | Persist file to `data/documents/` | | |
| | `patch_metadata(path, metadata)` | Merge new fields into existing record | | |
| | `get_records()` | Return all ingestion records dict | | |
| | `delete_record(filename)` | Remove a record from `<collection>.json` | | |
| #### `VectorStore` | |
| Thin wrapper around `langchain_chroma.Chroma`. | |
| | Method | Description | | |
| |---|---| | |
| | `get()` | Retrieve all documents | | |
| | `get_by_id(ids)` | Retrieve specific documents by ID | | |
| | `add_documents(docs)` | Embed + insert, skipping empty chunks | | |
| | `update_document(id, doc)` | Delete then re-insert with same ID | | |
| | `delete(ids)` | Remove documents by ID list | | |
| | `similarity_search_with_score()` | Wrapped Chroma search | | |
| ### Utilities (`app/utils/`) | |
| #### `preprocessing.py` | |
| | Function | Description | | |
| |---|---| | |
| | `preprocess(text)` | spaCy POS filter + lemmatize + stopword removal β joined string | | |
| | `normalize(text)` | Tokenize + strip blanks (lightweight, no POS) | | |
| | `preprocess_query(query)` | Applies `normalize()` to user queries | | |
| | `preprocess_documents(docs)` | Applies `preprocess()` to a document list in-place | | |
| | `preprocess_filename(path)` | Sanitize filename (remove special chars, lowercase) | | |
| #### `document_helpers.py` | |
| | Function | Description | | |
| |---|---| | |
| | `get_references_v2(docs, threshold)` | Convert `RetrievalResult` list β references dict + context string | | |
| | `get_references(docs, threshold)` | Same for raw `(Document, distance)` tuples (used by `query()`) | | |
| | `build_metadata(path)` | Parse YAML frontmatter from `.md`/`.txt` files | | |
| | `create_documents(chunks, ...)` | Attach standard metadata (UUID, timestamps, indices) to chunks | | |
| | `create_documents_from_text(text)` | Full pipeline: frontmatter parse β split β metadata attach | | |
| | `clean_metadata(metadata)` | Serialize datetime, coerce non-allowed types to string | | |
| #### `model_factory.py` | |
| | Function | Description | | |
| |---|---| | |
| | `get_embedding_model()` | Returns `GoogleGenerativeAIEmbeddings` | | |
| | `get_gemini_model()` | Returns `ChatGoogleGenerativeAI` | | |
| | `get_local_model()` | Returns `ChatLlamaCpp` (GGUF, CPU inference) | | |
| | `get_llm_model(provider)` | Dispatches to Gemini or Local with fallback logic | | |
| ### API Routes (`app/api/routes/`) | |
| #### `rag.py` β prefix `/api/v1/rag` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | GET | `/` | Health check | | |
| | POST | `/` | Semantic query | | |
| | POST | `/hybrid_query` | Hybrid RAG query (primary endpoint) | | |
| | POST | `/similarity_search` | Hybrid retrieval, no LLM response | | |
| | POST | `/search` | BM25-heavy document search | | |
| | POST | `/test` | Batch retrieval evaluation | | |
| | POST | `/test_classifier` | Classifier accuracy evaluation | | |
| | GET | `/test_classifier_dataset` | Run built-in test dataset, cache result | | |
| #### `vector_store.py` β prefix `/api/v1/vector` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | GET | `/` | List all documents (paginated, filterable) | | |
| | GET | `/filenames` | List ingested file records | | |
| | GET | `/{id}` | Get single document by ChromaDB ID | | |
| | POST | `/` | Upload + ingest file | | |
| | PUT | `/{id}` | Update document content/metadata | | |
| | DELETE | `/ids` | Bulk delete by ID list | | |
| | DELETE | `/{id}` | Delete single document | | |
| | DELETE | `/` | Filter-based delete (filename/source/contains) | | |
| ### Configuration (`app/core/config.py`) | |
| All settings are read from `.env` via Pydantic `BaseSettings`: | |
| ```python | |
| class Settings(BaseSettings): | |
| # Paths | |
| collection_name: str = "classifier_test_1" | |
| persist_directory: str = "./data/vector_stores/classifier_test_1" | |
| # Chunking | |
| chunk_size: int = 500 | |
| chunk_overlap: int = 100 | |
| # Retrieval | |
| similarity_top_k: int = 8 | |
| similarity_threshold: float = 0.4 | |
| # LLM Provider | |
| llm_provider: Literal["gemini", "local"] = "local" | |
| enable_fallback: bool = True | |
| # Models | |
| embedding_model_name: str = "models/gemini-embedding-001" | |
| gemini_model_name: str = "gemini-2.5-flash-lite" | |
| local_model_name: str = "EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf" | |
| # Generation | |
| max_output_tokens: int = 2048 | |
| local_max_tokens: int = 512 | |
| # Auth | |
| google_api_key: str # required β must be in .env | |
| ``` | |
| --- | |
| ## 6. Evaluation & Metrics | |
| ### Retrieval Evaluation (`test_queries` / `POST /api/v1/rag/test`) | |
| Tests each (question, expected_document, expected_chunk_index) triple against `hybrid_query`: | |
| | Metric | Formula | Interpretation | | |
| |---|---|---| | |
| | **Hit Rate** | `hits / total` | % of questions where the exact chunk was retrieved | | |
| | **Top-1 Hit Rate** | `rank==1 hits / total` | % of questions where exact chunk was top result | | |
| | **MRR** | `mean(1/rank)` | Mean Reciprocal Rank; higher = correct result ranked earlier | | |
| | **Doc Precision** | `correct_source_chunks / all_chunks` | How many retrieved chunks came from the right document | | |
| | **Doc Recall** | `1 if any correct_source_chunk else 0` | Did we retrieve at least one chunk from the right document? | | |
| | **Doc Noise** | `wrong_source_chunks / all_chunks` | Proportion of off-topic chunks in the result set | | |
| | **Error Rate** | `1 - hit_rate` | Miss rate for exact chunk retrieval | | |
| **Test Input Schema:** | |
| ```python | |
| class TestRequestSchema(BaseModel): | |
| tests: List[Test] # question + document + chunk_index | |
| k: int = 5 | |
| threshold: float = 0.4 | |
| ``` | |
| ### Classifier Evaluation (`test_classifier` / `POST /api/v1/rag/test_classifier`) | |
| Evaluates predictions for all 4 classification fields (`type`, `category`, `topic`, `intent`): | |
| | Metric | Notes | | |
| |---|---| | |
| | **Accuracy** | `sklearn.accuracy_score` | | |
| | **Precision (macro)** | `zero_division=0` | | |
| | **Recall (macro)** | `zero_division=0` | | |
| | **F1 Macro** | Unweighted average across classes | | |
| | **F1 Weighted** | Class-frequency weighted | | |
| | **Classification Report** | Full per-class breakdown (`output_dict=True`) | | |
| A bundled test dataset is stored in `app/utils/tests.py` as `classifier_test_dataset` and can be executed via `GET /api/v1/rag/test_classifier_dataset`. Results are **memoized** on the `RAGService.evaluation` dict for the lifetime of the server process. | |
| --- | |
| ## 7. Known Limitations | |
| ### Technical Debt | |
| - **`preprocess_query` is incomplete.** The function signature has an LLM-powered query rewriting block that is commented out. Currently it just calls `normalize()` (tokenize only), which means no stopword removal or lemmatization is applied to user queries (only to stored documents). | |
| - **`search_docs` does not honour `filename` as a metadata filter in Chroma.** The filter is applied in Python post-retrieval, which is inefficient for large collections. | |
| - **Count intent is synthetic.** The `"Total <topic>: N"` chunk is an auto-generated chunk during ingestion, not from the source document. If source data changes, stale count chunks can remain indexed. | |
| - **`VectorStore.get_dict()` has a `print(type(rows))`** debug statement left in production code. | |
| - **`FileService.__init__` docstring** has an extra backtick: `"`\`` class docstring`. | |
| ### Planned but Unimplemented | |
| - **Query rewriting via local LLM** β skeleton is commented out in `preprocess_query()`. | |
| - **Semantic caching** β no query result memoization at the API layer. | |
| - **Re-ranker** β no cross-encoder re-ranking step; relies only on RRF + boosting. | |
| - **`topic` field is not included in the ChromaDB hard filter** β only `type` + `intent` are hard-anchored; `category` and `topic` are soft `$or` hints. | |
| ### Performance Bottlenecks | |
| - **Local LLM (LlamaCpp)** is CPU-only with `n_ctx=8096` and `n_threads=4`. Response latency is high (~10β30s) on low-RAM systems. | |
| - **Classifier uses `SentenceTransformer` + `TF-IDF` features** β inference runs on every request with no caching of query embeddings. | |
| - **BM25 corpus is rebuilt from scratch per request** β `BM25Retriever.from_documents()` is called inside `_bm25_rank()` each time. | |
| - **`classify_test_dataset` in `app/utils/tests.py`** is a very large file (1.8MB) loaded at import time. | |
| - **The memoized evaluation** in `rag_service.evaluation` is not thread-safe if the server runs with multiple workers. | |
| --- | |
| ## 8. File Structure | |
| ``` | |
| VGEC-RAG-Chatbot/ | |
| β | |
| βββ app/ # Application package | |
| β βββ main.py # FastAPI app, router mounting, CORS middleware | |
| β βββ core/ | |
| β β βββ config.py # Pydantic Settings (all tuneable params) | |
| β β βββ paths.py # Path constants helper | |
| β β | |
| β βββ api/ | |
| β β βββ dependencies.py # lru_cache singleton for RAGService | |
| β β βββ routes/ | |
| β β β βββ rag.py # /rag endpoints (query, test, classifier) | |
| β β β βββ vector_store.py # /vector endpoints (CRUD for ChromaDB) | |
| β β β βββ settings.py # /settings endpoints | |
| β β βββ schemas/ | |
| β β βββ requests.py # RAGRequest, PaginationParams, etc. | |
| β β βββ tests.py # TestRequestSchema, TestClassifierReqSchema | |
| β β | |
| β βββ services/ | |
| β β βββ rag_service.py # RAGService (main orchestrator) | |
| β β βββ hybrid_retrieval.py # HybridRetrievalService + RRF logic | |
| β β βββ classifier_service.py # Classifier class + singleton clf | |
| β β βββ ingestion_service.py # IngestionService (chunking pipeline) | |
| β β βββ file_service.py # FileService (file I/O + metadata JSON) | |
| β β βββ vector_store.py # VectorStore (thin ChromaDB wrapper) | |
| β β βββ text_splitter.py # TextSplitter (RecursiveCharacter + variants) | |
| β β βββ document_loader.py # (legacy loader, not in primary path) | |
| β β | |
| β βββ utils/ | |
| β β βββ preprocessing.py # preprocess(), normalize(), preprocess_query() | |
| β β βββ document_helpers.py # get_references_v2(), build_metadata(), create_documents() | |
| β β βββ model_factory.py # get_llm_model(), get_embedding_model() | |
| β β βββ constants.py # stopwords list, short_words_mappings | |
| β β βββ embeddings.py # (thin embedding util) | |
| β β βββ llm_models.py # (thin LLM util) | |
| β β βββ tests.py # classifier_test_dataset (large, 1.8MB) | |
| β β | |
| β βββ prompts/ | |
| β βββ __init__.py # SYSTEM_PROMPT, wrap_exaone() | |
| β | |
| βββ ml_models/ | |
| β βββ classifier/ | |
| β β βββ chatbot_classifier.pkl # Pickled pipeline (models, tfidf, label encoders, etc.) | |
| β βββ embeddings/ # (Local embedding model weights, if any) | |
| β βββ llm/ | |
| β βββ EXAONE-3.5-2.4B-*.gguf # Local LLM weights | |
| β | |
| βββ data/ | |
| β βββ department_data/ # Source JSON files per department | |
| β β βββ computer_eng.json | |
| β β βββ civil.json | |
| β β βββ ... | |
| β βββ documents/ # Persistent copies of ingested files | |
| β βββ vector_stores/ | |
| β β βββ classifier_test_1/ # ChromaDB persist directory | |
| β βββ classifier_test_1.json # Ingestion metadata registry (FileService records) | |
| β βββ other_data/ # Misc data files | |
| β | |
| βββ temp/ # Staging area for uploaded files (auto-cleared) | |
| βββ scripts/ # Offline scripts (training, testing) | |
| βββ tests/ # Test files | |
| β | |
| βββ requirements.txt # Pinned production dependencies | |
| βββ .env # Runtime secrets (google_api_key, etc.) | |
| βββ .env.example # Template for .env | |
| βββ CODEBASE_DOCUMENTATION.md # This file | |
| ``` | |
| --- | |
| *End of documentation.* | |