Spaces:

harshvisualz
/

vgecbot

Sleeping

File size: 27,049 Bytes
# VGEC RAG Chatbot — Codebase Documentation

> **Generated:** 2026-03-25  
> **Version:** 1.0.0  
> **Scope:** Full system — ingestion, retrieval, classification, API, evaluation

---

## Table of Contents

1. [Project Overview](#1-project-overview)
2. [System Architecture](#2-system-architecture)
3. [Schema & Data Model](#3-schema--data-model)
4. [Retrieval Pipeline](#4-retrieval-pipeline)
5. [Key Classes & Modules](#5-key-classes--modules)
6. [Evaluation & Metrics](#6-evaluation--metrics)
7. [Known Limitations](#7-known-limitations)
8. [File Structure](#8-file-structure)

---

## 1. Project Overview

### Purpose

**VGEC RAG Chatbot** is a Retrieval-Augmented Generation (RAG) chatbot for **Vishwakarma Government Engineering College (VGEC), Chandkheda, Gujarat**. It allows students, faculty, and visitors to query structured information about the institution — departments, faculty, syllabus, labs, intake capacity, and more — through natural language.

### Domain

- **Institution:** VGEC (Government Engineering College, Gujarat)
- **Data Coverage:** Department-level information for multiple disciplines (Computer Engineering, Civil, Electrical, IT, ECE, etc.)
- **Topics:** Faculty lists, lab facilities, syllabus details, HOD info, research activities, intake capacity, achievements

### Tech Stack

| Layer | Technology |
|---|---|
| **API Framework** | FastAPI |
| **Vector Database** | ChromaDB (persistent, local) |
| **Embeddings** | Google `gemini-embedding-001` (via `langchain-google-genai`) |
| **LLM (Cloud)** | Google Gemini `gemini-2.5-flash-lite` |
| **LLM (Local)** | `EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf` via `llama-cpp-python` |
| **NLP / Preprocessing** | spaCy (`en_core_web_sm`), NLTK (PorterStemmer) |
| **Classifier** | Scikit-learn `LogisticRegression` + `SentenceTransformer` (`MongoDB/mdbr-leaf-mt`) |
| **BM25** | `langchain-community` `BM25Retriever` |
| **Chunking** | LangChain `RecursiveCharacterTextSplitter` |
| **Config** | Pydantic `BaseSettings` (`.env`-backed) |

### Key Features Implemented

- ✅ Structured JSON ingestion with intent-aware chunking
- ✅ Hybrid retrieval: BM25 + vector search fused via Reciprocal Rank Fusion (RRF)
- ✅ Intent/metadata classification with confidence-gated ChromaDB filters
- ✅ Abbreviation expansion (`CE` → `Computer Engineering`, etc.)
- ✅ Multi-turn conversation history support
- ✅ Dual LLM backend with automatic fallback (Gemini ↔ Local)
- ✅ Full CRUD REST API for vector store management
- ✅ Offline evaluation endpoint (MRR, hit rate, noise rate)
- ✅ Classifier accuracy evaluation endpoint

---

## 2. System Architecture

### Component Diagram

```
                         ┌──────────────────────────┐
                         │        FastAPI App         │
                         │  /api/v1/rag   /vector    │
                         └──────────┬───────────────┘
                                    │ DI (lru_cache)
                         ┌──────────▼───────────────┐
                         │        RAGService          │
                         │  (core orchestrator)       │
                         └──┬───────────┬────────────┘
                            │           │
              ┌─────────────▼──┐   ┌───▼──────────────────┐
              │ IngestionService│   │  HybridRetrievalService│
              │  (write path)  │   │   (read path)          │
              └──────┬──────── ┘   └───┬──────────┬─────── ┘
                     │                 │           │
          ┌──────────▼──┐   ┌──────────▼──┐  ┌────▼──────────┐
          │  FileService │   │ ClassifierSvc│  │  VectorStore  │
          │ (file +meta) │   │(clf predict) │  │  (ChromaDB)   │
          └──────────────┘   └─────────────┘  └───────────────┘
```

### Data Flow

#### Ingestion Path

```
File Upload (PDF/MD/TXT/JSON)
   │
   ▼
FileService.read_file()          ← type-aware loading (PyMuPDF for PDF)
   │  returns: Document + metadata
   ▼
FileService.write_file()         ← persist copy to data/documents/
   │
   ▼
IngestionService.handle_*_docs() ← route by file extension
   │
   ├─ JSON → handle_json_docs()  ← intent-aware chunks (list / detail / count)
   └─ text → handle_text_docs()  ← RecursiveCharacterTextSplitter + normalize()
   │
   ▼
VectorStore.add_documents()      ← embed + upsert into ChromaDB
   │
   ▼
FileService.patch_metadata()     ← update ingestion record JSON (chunk count, timing, size)
```

#### Query Path

```
User Question
   │
   ▼
preprocess_query()               ← tokenize + strip stopwords (spaCy) + normalize
   │
   ▼
HybridRetrievalService.retrieve()
   │
   ├─ clf.expand_abbreviations() ← CE → Computer Engineering
   ├─ clf.predict_with_filter()  ← LogReg predict → Chroma $and/$or filter
   ├─ _vector_rank()             ← ChromaDB similarity_search_with_score (k=15)
   ├─ _bm25_rank()               ← BM25 over the vector candidate pool
   ├─ _reciprocal_rank_fusion()  ← weighted RRF merge
   ├─ metadata score boosting    ← multiply fused scores for confident matches
   └─ _apply_title_boost()       ← per-query-word title match bonus
   │
   ▼
get_references_v2()              ← filter by threshold, build context string
   │
   ▼
LLM.invoke(prompt)               ← Gemini or local LlamaCpp
   │
   ▼
Return: { answer, references, context, threshold_used, k_used }
```

### External Dependencies

| Dependency | Role | Provider |
|---|---|---|
| ChromaDB | Persistent vector store | Local disk |
| Google Gemini API | Embeddings + LLM generation | Google Cloud |
| LlamaCpp (GGUF model) | Local LLM fallback | Local CPU |
| Sentence Transformers | Classifier feature extraction | HuggingFace Hub |
| spaCy `en_core_web_sm` | POS tagging / lemmatization | Local |

---

## 3. Schema & Data Model

### Source JSON Format

Source data files (e.g. `computer_eng.json`) follow this schema:

```json
{
  "id": "computer-engineering-department",
  "name": "Computer Engineering Department",
  "source": "https://www.vgecg.ac.in/department.php?dept=3",
  "category": "computer_eng",
  "type": "department",
  "created_date": "2026-02-19",
  "content": {
    "<topic_key>": {
      "list": ["item 1", "item 2", "..."],
      "details": "Paragraph describing the topic."
    }
  }
}
```

**Top-level fields:**

| Field | Type | Description |
|---|---|---|
| `id` | string | Unique document identifier |
| `name` | string | Human-readable institution/department name |
| `source` | string | Authoritative URL |
| `category` | string | Department slug (e.g. `computer_eng`) |
| `type` | string | Document type (e.g. `department`) |
| `created_date` | string (ISO) | Data creation date |
| `content` | object | Topic map; each key = a topic |

### Chunk Metadata Schema (stored in ChromaDB)

Every vector chunk stored in Chroma carries the following metadata:

| Field | Type | Source |
|---|---|---|
| `id` | string (UUID) | Auto-generated |
| `title` | string | Document name / topic key |
| `source` | string | Source URL |
| `source_file` | string | Filename (e.g. `computer_eng.json`) |
| `type` | string | Taxonomy level 1 (e.g. `department`) |
| `category` | string | Taxonomy level 2 (e.g. `computer_eng`) |
| `topic` | string | Taxonomy level 3 (e.g. `faculty`) |
| `intent` | string | Chunk intent: `list`, `detail`, or `count` |
| `chunk_index` | int | Sequential index within file |
| `created_date` | string (ISO) | Ingestion timestamp |
| `updated_at` | string (ISO) | Last modification timestamp |
| `ext` | string | Source file extension (`json`, `pdf`, `md`, `txt`) |

### Hierarchical Taxonomy

The classifier predicts and ChromaDB filters operate on a 3-level hierarchy:

```
type
 └── category
      └── topic
           └── intent  (list | detail | count)
```

**Example mapping (Computer Engineering):**

```
type: "department"
  └── category: "computer_eng"
         ├── topic: "faculty"    → intent: list | detail
         ├── topic: "lab"        → intent: list | detail
         ├── topic: "syllabus"   → intent: list | detail
         ├── topic: "hod"        → intent: list | detail
         ├── topic: "intake"     → intent: list | detail
         ├── topic: "research"   → intent: list | detail
         └── topic: "achievements"
```

### Document Chunking Strategy

**JSON documents** use a hand-crafted, intent-aware strategy in `IngestionService.handle_json_docs()`:

| Intent | Chunk Content | Metadata |
|---|---|---|
| `list` | Numbered list: `1. item\n2. item\n...` | `intent=list` |
| `count` | `"Total <topic>: N"` (auto-generated) | `intent=count` |
| `detail` | Raw paragraph text | `intent=detail` |

**Text/PDF/Markdown documents** use `RecursiveCharacterTextSplitter`:
- Default: `chunk_size=500`, `chunk_overlap=100`
- Separator priority: `\n\n` → `\n` → ` ` → (character)
- Markdown variant respects `---` section delimiters
- Content is passed through `normalize()` (tokenize + strip blanks) before storage

---

## 4. Retrieval Pipeline

### Query Processing Flow

```python
# Step 1: Normalize input
question = preprocess_query(question)
# → spaCy POS filter (NOUN, PROPN, VERB, NUM, ADJ) + lemmatize + strip stopwords

# Step 2: Expand abbreviations
processed_query = clf.expand_abbreviations(query)
# → "CE dept" → "computer engineering department"

# Step 3: Classify intent/metadata
filters = clf.predict_with_filter([processed_query])
# → {"$and": [{"type": "department"}, {"intent": "list"}, {"$or": [...]}]}

# Step 4: Vector search with optional filter
raw_results = chroma.similarity_search_with_score(query, k=15, filter=filters)
# Fallback: if filtered results empty, retry without filter

# Step 5: BM25 re-rank over vector candidates
bm25_results = BM25Retriever.from_documents(candidate_docs)

# Step 6: RRF fusion
fused_score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
              + vector_weight * 1/(rrf_k + rank_vec)

# Step 7: Metadata confidence boosting
if doc.metadata[field] == predicted_val and conf > 0.90:
    result.fused_score *= boost_factor  # 1.10–1.20

# Step 8: Title word boost
for word in query_words:
    if word in doc.title:
        result.fused_score += title_boost_per_word  # 0.004

# Step 9: Threshold filter + sort + top-k
results = [r for r in results if r.fused_score >= threshold]
```

### Classifier Thresholds

The `Classifier` uses two separate threshold tables:

**Prediction threshold** — below this, the field is set to `None` (not used at all):

| Field | Threshold |
|---|---|
| `type` | 0.40 |
| `category` | 0.40 |
| `topic` | 0.50 |
| `intent` | 0.60 |

**Filter threshold** — above this, the field becomes a hard ChromaDB `$and` filter:

| Field | Threshold |
|---|---|
| `type` | 0.65 |
| `category` | 0.65 |
| `topic` | 0.70 |

### Filter Construction Logic (`_build_filter`)

```python
# Gate: if type confidence < 0.65 → return None (full scan)
# Hard anchors (always included if type passes):
#   - type == predicted_type
#   - intent == predicted_intent  (special: "count" expands to count OR detail)
# Soft hints (combined as $or):
#   - category == predicted_category  (if conf >= 0.65, else "general")
#   - topic == predicted_topic        (if conf >= 0.70, else "general")
```

### Hybrid Retrieval Config (Defaults)

| Parameter | `hybrid_query` | `search_docs` |
|---|---|---|
| `candidate_k` | 15 | 15 |
| `top_k` (final) | `settings.similarity_top_k` (8) | k (param) |
| `bm25_weight` | 0.45 | 0.70 |
| `vector_weight` | 0.55 | 0.30 |
| `rrf_k` | 20 | 20 |
| `bm25_k1` | 1.2 | 1.5 |
| `bm25_b` | 0.9 | 0.75 |
| `title_boost_per_word` | 0.004 | 0.004 |
| `score_threshold` | 0.4 | 0.4 |

> **Note:** `search_docs` is BM25-heavy (0.70) since it is used for keyword-oriented document browsing, while `hybrid_query` is vector-heavy for semantic QA.

---

## 5. Key Classes & Modules

### Services (`app/services/`)

#### `RAGService`

Main orchestrator. Singleton via `lru_cache` in `dependencies.py`.

| Method | Description |
|---|---|
| `query()` | Semantic-only QA (vector search → LLM) |
| `hybrid_query()` | Hybrid QA (BM25 + vector → RRF → LLM) |
| `search_docs()` | BM25-heavy document search, no LLM |
| `ingest_documents()` | Ingest a file path into the vector store |
| `get_filenames()` | Return all tracked file metadata records |
| `test_queries()` | Batch retrieval evaluation (MRR, precision, noise) |
| `test_classifier()` | Batch classifier accuracy evaluation |
| `delete_database()` | Drop the entire ChromaDB collection |

#### `HybridRetrievalService`

Stateless per-request service created inline by `RAGService`.

| Method | Description |
|---|---|
| `retrieve(query)` | Full hybrid retrieval pipeline; returns `List[RetrievalResult]` |
| `_vector_rank()` | Chroma similarity search + classifier filter |
| `_bm25_rank()` | BM25 over candidate pool |
| `_reciprocal_rank_fusion()` | Merge both ranked lists via RRF |
| `_apply_title_boost()` | Word-level title match score bonus |

**`RetrievalResult` dataclass:**

```python
@dataclass
class RetrievalResult:
    document: Document
    fused_score: float
    bm25_rank: Optional[int]
    vector_rank: Optional[int]
    title_boost: float
```

#### `Classifier`

Loaded at startup from a pickled pipeline (`chatbot_classifier.pkl`).

| Method | Description |
|---|---|
| `predict(queries)` | Returns list of `{type, category, topic, intent, *_conf}` dicts |
| `predict_with_filter(queries)` | Returns a ChromaDB-compatible filter dict or `None` |
| `expand_abbreviations(text)` | Regex-based abbreviation expansion |
| `get_features(queries)` | Build `[SentenceTransformer embedding | TF-IDF]` feature matrix |
| `train_models(df)` | Train 4 LogisticRegression classifiers (offline use) |

#### `IngestionService`

| Method | Description |
|---|---|
| `ingest(file_path)` | Load + chunk a file; returns `List[Document]` |
| `handle_json_docs()` | Intent-aware chunking for structured JSON data |
| `handle_text_docs()` | Recursive character splitting for unstructured text |
| `get_records()` | Delegate to `FileService.get_records()` |
| `delete_record(filename)` | Remove a file's metadata record |
| `path_record(path, metadata)` | Patch ingestion stats after indexing |

#### `FileService`

| Method | Description |
|---|---|
| `read_file(path)` | Load file content; dispatches by extension |
| `write_file(path, content, metadata)` | Persist file to `data/documents/` |
| `patch_metadata(path, metadata)` | Merge new fields into existing record |
| `get_records()` | Return all ingestion records dict |
| `delete_record(filename)` | Remove a record from `<collection>.json` |

#### `VectorStore`

Thin wrapper around `langchain_chroma.Chroma`.

| Method | Description |
|---|---|
| `get()` | Retrieve all documents |
| `get_by_id(ids)` | Retrieve specific documents by ID |
| `add_documents(docs)` | Embed + insert, skipping empty chunks |
| `update_document(id, doc)` | Delete then re-insert with same ID |
| `delete(ids)` | Remove documents by ID list |
| `similarity_search_with_score()` | Wrapped Chroma search |

### Utilities (`app/utils/`)

#### `preprocessing.py`

| Function | Description |
|---|---|
| `preprocess(text)` | spaCy POS filter + lemmatize + stopword removal → joined string |
| `normalize(text)` | Tokenize + strip blanks (lightweight, no POS) |
| `preprocess_query(query)` | Applies `normalize()` to user queries |
| `preprocess_documents(docs)` | Applies `preprocess()` to a document list in-place |
| `preprocess_filename(path)` | Sanitize filename (remove special chars, lowercase) |

#### `document_helpers.py`

| Function | Description |
|---|---|
| `get_references_v2(docs, threshold)` | Convert `RetrievalResult` list → references dict + context string |
| `get_references(docs, threshold)` | Same for raw `(Document, distance)` tuples (used by `query()`) |
| `build_metadata(path)` | Parse YAML frontmatter from `.md`/`.txt` files |
| `create_documents(chunks, ...)` | Attach standard metadata (UUID, timestamps, indices) to chunks |
| `create_documents_from_text(text)` | Full pipeline: frontmatter parse → split → metadata attach |
| `clean_metadata(metadata)` | Serialize datetime, coerce non-allowed types to string |

#### `model_factory.py`

| Function | Description |
|---|---|
| `get_embedding_model()` | Returns `GoogleGenerativeAIEmbeddings` |
| `get_gemini_model()` | Returns `ChatGoogleGenerativeAI` |
| `get_local_model()` | Returns `ChatLlamaCpp` (GGUF, CPU inference) |
| `get_llm_model(provider)` | Dispatches to Gemini or Local with fallback logic |

### API Routes (`app/api/routes/`)

#### `rag.py` — prefix `/api/v1/rag`

| Method | Endpoint | Description |
|---|---|---|
| GET | `/` | Health check |
| POST | `/` | Semantic query |
| POST | `/hybrid_query` | Hybrid RAG query (primary endpoint) |
| POST | `/similarity_search` | Hybrid retrieval, no LLM response |
| POST | `/search` | BM25-heavy document search |
| POST | `/test` | Batch retrieval evaluation |
| POST | `/test_classifier` | Classifier accuracy evaluation |
| GET | `/test_classifier_dataset` | Run built-in test dataset, cache result |

#### `vector_store.py` — prefix `/api/v1/vector`

| Method | Endpoint | Description |
|---|---|---|
| GET | `/` | List all documents (paginated, filterable) |
| GET | `/filenames` | List ingested file records |
| GET | `/{id}` | Get single document by ChromaDB ID |
| POST | `/` | Upload + ingest file |
| PUT | `/{id}` | Update document content/metadata |
| DELETE | `/ids` | Bulk delete by ID list |
| DELETE | `/{id}` | Delete single document |
| DELETE | `/` | Filter-based delete (filename/source/contains) |

### Configuration (`app/core/config.py`)

All settings are read from `.env` via Pydantic `BaseSettings`:

```python
class Settings(BaseSettings):
    # Paths
    collection_name: str = "classifier_test_1"
    persist_directory: str = "./data/vector_stores/classifier_test_1"

    # Chunking
    chunk_size: int = 500
    chunk_overlap: int = 100

    # Retrieval
    similarity_top_k: int = 8
    similarity_threshold: float = 0.4

    # LLM Provider
    llm_provider: Literal["gemini", "local"] = "local"
    enable_fallback: bool = True

    # Models
    embedding_model_name: str = "models/gemini-embedding-001"
    gemini_model_name: str = "gemini-2.5-flash-lite"
    local_model_name: str = "EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf"

    # Generation
    max_output_tokens: int = 2048
    local_max_tokens: int = 512

    # Auth
    google_api_key: str  # required — must be in .env
```

---

## 6. Evaluation & Metrics

### Retrieval Evaluation (`test_queries` / `POST /api/v1/rag/test`)

Tests each (question, expected_document, expected_chunk_index) triple against `hybrid_query`:

| Metric | Formula | Interpretation |
|---|---|---|
| **Hit Rate** | `hits / total` | % of questions where the exact chunk was retrieved |
| **Top-1 Hit Rate** | `rank==1 hits / total` | % of questions where exact chunk was top result |
| **MRR** | `mean(1/rank)` | Mean Reciprocal Rank; higher = correct result ranked earlier |
| **Doc Precision** | `correct_source_chunks / all_chunks` | How many retrieved chunks came from the right document |
| **Doc Recall** | `1 if any correct_source_chunk else 0` | Did we retrieve at least one chunk from the right document? |
| **Doc Noise** | `wrong_source_chunks / all_chunks` | Proportion of off-topic chunks in the result set |
| **Error Rate** | `1 - hit_rate` | Miss rate for exact chunk retrieval |

**Test Input Schema:**

```python
class TestRequestSchema(BaseModel):
    tests: List[Test]   # question + document + chunk_index
    k: int = 5
    threshold: float = 0.4
```

### Classifier Evaluation (`test_classifier` / `POST /api/v1/rag/test_classifier`)

Evaluates predictions for all 4 classification fields (`type`, `category`, `topic`, `intent`):

| Metric | Notes |
|---|---|
| **Accuracy** | `sklearn.accuracy_score` |
| **Precision (macro)** | `zero_division=0` |
| **Recall (macro)** | `zero_division=0` |
| **F1 Macro** | Unweighted average across classes |
| **F1 Weighted** | Class-frequency weighted |
| **Classification Report** | Full per-class breakdown (`output_dict=True`) |

A bundled test dataset is stored in `app/utils/tests.py` as `classifier_test_dataset` and can be executed via `GET /api/v1/rag/test_classifier_dataset`. Results are **memoized** on the `RAGService.evaluation` dict for the lifetime of the server process.

---

## 7. Known Limitations

### Technical Debt

- **`preprocess_query` is incomplete.** The function signature has an LLM-powered query rewriting block that is commented out. Currently it just calls `normalize()` (tokenize only), which means no stopword removal or lemmatization is applied to user queries (only to stored documents).
- **`search_docs` does not honour `filename` as a metadata filter in Chroma.** The filter is applied in Python post-retrieval, which is inefficient for large collections.
- **Count intent is synthetic.** The `"Total <topic>: N"` chunk is an auto-generated chunk during ingestion, not from the source document. If source data changes, stale count chunks can remain indexed.
- **`VectorStore.get_dict()` has a `print(type(rows))`** debug statement left in production code.
- **`FileService.__init__` docstring** has an extra backtick: `"`\`` class docstring`.

### Planned but Unimplemented

- **Query rewriting via local LLM** — skeleton is commented out in `preprocess_query()`.
- **Semantic caching** — no query result memoization at the API layer.
- **Re-ranker** — no cross-encoder re-ranking step; relies only on RRF + boosting.
- **`topic` field is not included in the ChromaDB hard filter** — only `type` + `intent` are hard-anchored; `category` and `topic` are soft `$or` hints.

### Performance Bottlenecks

- **Local LLM (LlamaCpp)** is CPU-only with `n_ctx=8096` and `n_threads=4`. Response latency is high (~10–30s) on low-RAM systems.
- **Classifier uses `SentenceTransformer` + `TF-IDF` features** — inference runs on every request with no caching of query embeddings.
- **BM25 corpus is rebuilt from scratch per request** — `BM25Retriever.from_documents()` is called inside `_bm25_rank()` each time.
- **`classify_test_dataset` in `app/utils/tests.py`** is a very large file (1.8MB) loaded at import time.
- **The memoized evaluation** in `rag_service.evaluation` is not thread-safe if the server runs with multiple workers.

---

## 8. File Structure

```
VGEC-RAG-Chatbot/
│
├── app/                            # Application package
│   ├── main.py                     # FastAPI app, router mounting, CORS middleware
│   ├── core/
│   │   ├── config.py               # Pydantic Settings (all tuneable params)
│   │   └── paths.py                # Path constants helper
│   │
│   ├── api/
│   │   ├── dependencies.py         # lru_cache singleton for RAGService
│   │   ├── routes/
│   │   │   ├── rag.py              # /rag endpoints (query, test, classifier)
│   │   │   ├── vector_store.py     # /vector endpoints (CRUD for ChromaDB)
│   │   │   └── settings.py         # /settings endpoints
│   │   └── schemas/
│   │       ├── requests.py         # RAGRequest, PaginationParams, etc.
│   │       └── tests.py            # TestRequestSchema, TestClassifierReqSchema
│   │
│   ├── services/
│   │   ├── rag_service.py          # RAGService (main orchestrator)
│   │   ├── hybrid_retrieval.py     # HybridRetrievalService + RRF logic
│   │   ├── classifier_service.py   # Classifier class + singleton clf
│   │   ├── ingestion_service.py    # IngestionService (chunking pipeline)
│   │   ├── file_service.py         # FileService (file I/O + metadata JSON)
│   │   ├── vector_store.py         # VectorStore (thin ChromaDB wrapper)
│   │   ├── text_splitter.py        # TextSplitter (RecursiveCharacter + variants)
│   │   └── document_loader.py      # (legacy loader, not in primary path)
│   │
│   ├── utils/
│   │   ├── preprocessing.py        # preprocess(), normalize(), preprocess_query()
│   │   ├── document_helpers.py     # get_references_v2(), build_metadata(), create_documents()
│   │   ├── model_factory.py        # get_llm_model(), get_embedding_model()
│   │   ├── constants.py            # stopwords list, short_words_mappings
│   │   ├── embeddings.py           # (thin embedding util)
│   │   ├── llm_models.py           # (thin LLM util)
│   │   └── tests.py                # classifier_test_dataset (large, 1.8MB)
│   │
│   └── prompts/
│       └── __init__.py             # SYSTEM_PROMPT, wrap_exaone()
│
├── ml_models/
│   ├── classifier/
│   │   └── chatbot_classifier.pkl  # Pickled pipeline (models, tfidf, label encoders, etc.)
│   ├── embeddings/                 # (Local embedding model weights, if any)
│   └── llm/
│       └── EXAONE-3.5-2.4B-*.gguf # Local LLM weights
│
├── data/
│   ├── department_data/            # Source JSON files per department
│   │   ├── computer_eng.json
│   │   ├── civil.json
│   │   └── ...
│   ├── documents/                  # Persistent copies of ingested files
│   ├── vector_stores/
│   │   └── classifier_test_1/      # ChromaDB persist directory
│   ├── classifier_test_1.json      # Ingestion metadata registry (FileService records)
│   └── other_data/                 # Misc data files
│
├── temp/                           # Staging area for uploaded files (auto-cleared)
├── scripts/                        # Offline scripts (training, testing)
├── tests/                          # Test files
│
├── requirements.txt                # Pinned production dependencies
├── .env                            # Runtime secrets (google_api_key, etc.)
├── .env.example                    # Template for .env
└── CODEBASE_DOCUMENTATION.md       # This file
```

---

*End of documentation.*