Spaces:

harshvisualz
/

vgecbot

Sleeping

App Files Files Community

vgecbot / CODEBASE_DOCUMENTATION.md

harsh-dev

docker deployment

4225666 about 1 month ago

preview code

raw

history blame contribute delete

27 kB

VGEC RAG Chatbot — Codebase Documentation

Generated: 2026-03-25
Version: 1.0.0
Scope: Full system — ingestion, retrieval, classification, API, evaluation

Project Overview
System Architecture
Schema & Data Model
Retrieval Pipeline
Key Classes & Modules
Evaluation & Metrics
Known Limitations
File Structure

1. Project Overview

Purpose

VGEC RAG Chatbot is a Retrieval-Augmented Generation (RAG) chatbot for Vishwakarma Government Engineering College (VGEC), Chandkheda, Gujarat. It allows students, faculty, and visitors to query structured information about the institution — departments, faculty, syllabus, labs, intake capacity, and more — through natural language.

Domain

Institution: VGEC (Government Engineering College, Gujarat)
Data Coverage: Department-level information for multiple disciplines (Computer Engineering, Civil, Electrical, IT, ECE, etc.)
Topics: Faculty lists, lab facilities, syllabus details, HOD info, research activities, intake capacity, achievements

Tech Stack

Layer	Technology
API Framework	FastAPI
Vector Database	ChromaDB (persistent, local)
Embeddings	Google `gemini-embedding-001` (via `langchain-google-genai`)
LLM (Cloud)	Google Gemini `gemini-2.5-flash-lite`
LLM (Local)	`EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf` via `llama-cpp-python`
NLP / Preprocessing	spaCy (`en_core_web_sm`), NLTK (PorterStemmer)
Classifier	Scikit-learn `LogisticRegression` + `SentenceTransformer` (`MongoDB/mdbr-leaf-mt`)
BM25	`langchain-community` `BM25Retriever`
Chunking	LangChain `RecursiveCharacterTextSplitter`
Config	Pydantic `BaseSettings` (`.env`-backed)

Key Features Implemented

✅ Structured JSON ingestion with intent-aware chunking
✅ Hybrid retrieval: BM25 + vector search fused via Reciprocal Rank Fusion (RRF)
✅ Intent/metadata classification with confidence-gated ChromaDB filters
✅ Abbreviation expansion (CE → Computer Engineering, etc.)
✅ Multi-turn conversation history support
✅ Dual LLM backend with automatic fallback (Gemini ↔ Local)
✅ Full CRUD REST API for vector store management
✅ Offline evaluation endpoint (MRR, hit rate, noise rate)
✅ Classifier accuracy evaluation endpoint

2. System Architecture

Component Diagram

                         ┌──────────────────────────┐
                         │        FastAPI App         │
                         │  /api/v1/rag   /vector    │
                         └──────────┬───────────────┘
                                    │ DI (lru_cache)
                         ┌──────────▼───────────────┐
                         │        RAGService          │
                         │  (core orchestrator)       │
                         └──┬───────────┬────────────┘
                            │           │
              ┌─────────────▼──┐   ┌───▼──────────────────┐
              │ IngestionService│   │  HybridRetrievalService│
              │  (write path)  │   │   (read path)          │
              └──────┬──────── ┘   └───┬──────────┬─────── ┘
                     │                 │           │
          ┌──────────▼──┐   ┌──────────▼──┐  ┌────▼──────────┐
          │  FileService │   │ ClassifierSvc│  │  VectorStore  │
          │ (file +meta) │   │(clf predict) │  │  (ChromaDB)   │
          └──────────────┘   └─────────────┘  └───────────────┘

Data Flow

Ingestion Path

File Upload (PDF/MD/TXT/JSON)
   │
   ▼
FileService.read_file()          ← type-aware loading (PyMuPDF for PDF)
   │  returns: Document + metadata
   ▼
FileService.write_file()         ← persist copy to data/documents/
   │
   ▼
IngestionService.handle_*_docs() ← route by file extension
   │
   ├─ JSON → handle_json_docs()  ← intent-aware chunks (list / detail / count)
   └─ text → handle_text_docs()  ← RecursiveCharacterTextSplitter + normalize()
   │
   ▼
VectorStore.add_documents()      ← embed + upsert into ChromaDB
   │
   ▼
FileService.patch_metadata()     ← update ingestion record JSON (chunk count, timing, size)

Query Path

User Question
   │
   ▼
preprocess_query()               ← tokenize + strip stopwords (spaCy) + normalize
   │
   ▼
HybridRetrievalService.retrieve()
   │
   ├─ clf.expand_abbreviations() ← CE → Computer Engineering
   ├─ clf.predict_with_filter()  ← LogReg predict → Chroma $and/$or filter
   ├─ _vector_rank()             ← ChromaDB similarity_search_with_score (k=15)
   ├─ _bm25_rank()               ← BM25 over the vector candidate pool
   ├─ _reciprocal_rank_fusion()  ← weighted RRF merge
   ├─ metadata score boosting    ← multiply fused scores for confident matches
   └─ _apply_title_boost()       ← per-query-word title match bonus
   │
   ▼
get_references_v2()              ← filter by threshold, build context string
   │
   ▼
LLM.invoke(prompt)               ← Gemini or local LlamaCpp
   │
   ▼
Return: { answer, references, context, threshold_used, k_used }

External Dependencies

Dependency	Role	Provider
ChromaDB	Persistent vector store	Local disk
Google Gemini API	Embeddings + LLM generation	Google Cloud
LlamaCpp (GGUF model)	Local LLM fallback	Local CPU
Sentence Transformers	Classifier feature extraction	HuggingFace Hub
spaCy `en_core_web_sm`	POS tagging / lemmatization	Local

3. Schema & Data Model

Source JSON Format

Source data files (e.g. computer_eng.json) follow this schema:

{
  "id": "computer-engineering-department",
  "name": "Computer Engineering Department",
  "source": "https://www.vgecg.ac.in/department.php?dept=3",
  "category": "computer_eng",
  "type": "department",
  "created_date": "2026-02-19",
  "content": {
    "<topic_key>": {
      "list": ["item 1", "item 2", "..."],
      "details": "Paragraph describing the topic."
    }
  }
}

Top-level fields:

Field	Type	Description
`id`	string	Unique document identifier
`name`	string	Human-readable institution/department name
`source`	string	Authoritative URL
`category`	string	Department slug (e.g. `computer_eng`)
`type`	string	Document type (e.g. `department`)
`created_date`	string (ISO)	Data creation date
`content`	object	Topic map; each key = a topic

Chunk Metadata Schema (stored in ChromaDB)

Every vector chunk stored in Chroma carries the following metadata:

Field	Type	Source
`id`	string (UUID)	Auto-generated
`title`	string	Document name / topic key
`source`	string	Source URL
`source_file`	string	Filename (e.g. `computer_eng.json`)
`type`	string	Taxonomy level 1 (e.g. `department`)
`category`	string	Taxonomy level 2 (e.g. `computer_eng`)
`topic`	string	Taxonomy level 3 (e.g. `faculty`)
`intent`	string	Chunk intent: `list`, `detail`, or `count`
`chunk_index`	int	Sequential index within file
`created_date`	string (ISO)	Ingestion timestamp
`updated_at`	string (ISO)	Last modification timestamp
`ext`	string	Source file extension (`json`, `pdf`, `md`, `txt`)

Hierarchical Taxonomy

The classifier predicts and ChromaDB filters operate on a 3-level hierarchy:

type
 └── category
      └── topic
           └── intent  (list | detail | count)

Example mapping (Computer Engineering):

type: "department"
  └── category: "computer_eng"
         ├── topic: "faculty"    → intent: list | detail
         ├── topic: "lab"        → intent: list | detail
         ├── topic: "syllabus"   → intent: list | detail
         ├── topic: "hod"        → intent: list | detail
         ├── topic: "intake"     → intent: list | detail
         ├── topic: "research"   → intent: list | detail
         └── topic: "achievements"

Document Chunking Strategy

JSON documents use a hand-crafted, intent-aware strategy in IngestionService.handle_json_docs():

Intent	Chunk Content	Metadata
`list`	Numbered list: `1. item\n2. item\n...`	`intent=list`
`count`	`"Total <topic>: N"` (auto-generated)	`intent=count`
`detail`	Raw paragraph text	`intent=detail`

Text/PDF/Markdown documents use RecursiveCharacterTextSplitter:

Default: chunk_size=500, chunk_overlap=100
Separator priority: \n\n → \n → → (character)
Markdown variant respects --- section delimiters
Content is passed through normalize() (tokenize + strip blanks) before storage

4. Retrieval Pipeline

Query Processing Flow

# Step 1: Normalize input
question = preprocess_query(question)
# → spaCy POS filter (NOUN, PROPN, VERB, NUM, ADJ) + lemmatize + strip stopwords

# Step 2: Expand abbreviations
processed_query = clf.expand_abbreviations(query)
# → "CE dept" → "computer engineering department"

# Step 3: Classify intent/metadata
filters = clf.predict_with_filter([processed_query])
# → {"$and": [{"type": "department"}, {"intent": "list"}, {"$or": [...]}]}

# Step 4: Vector search with optional filter
raw_results = chroma.similarity_search_with_score(query, k=15, filter=filters)
# Fallback: if filtered results empty, retry without filter

# Step 5: BM25 re-rank over vector candidates
bm25_results = BM25Retriever.from_documents(candidate_docs)

# Step 6: RRF fusion
fused_score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
              + vector_weight * 1/(rrf_k + rank_vec)

# Step 7: Metadata confidence boosting
if doc.metadata[field] == predicted_val and conf > 0.90:
    result.fused_score *= boost_factor  # 1.10–1.20

# Step 8: Title word boost
for word in query_words:
    if word in doc.title:
        result.fused_score += title_boost_per_word  # 0.004

# Step 9: Threshold filter + sort + top-k
results = [r for r in results if r.fused_score >= threshold]

Classifier Thresholds

The Classifier uses two separate threshold tables:

Prediction threshold — below this, the field is set to None (not used at all):

Field	Threshold
`type`	0.40
`category`	0.40
`topic`	0.50
`intent`	0.60

Filter threshold — above this, the field becomes a hard ChromaDB $and filter:

Field	Threshold
`type`	0.65
`category`	0.65
`topic`	0.70

Filter Construction Logic (`_build_filter`)

# Gate: if type confidence < 0.65 → return None (full scan)
# Hard anchors (always included if type passes):
#   - type == predicted_type
#   - intent == predicted_intent  (special: "count" expands to count OR detail)
# Soft hints (combined as $or):
#   - category == predicted_category  (if conf >= 0.65, else "general")
#   - topic == predicted_topic        (if conf >= 0.70, else "general")

Hybrid Retrieval Config (Defaults)

Parameter	`hybrid_query`	`search_docs`
`candidate_k`	15	15
`top_k` (final)	`settings.similarity_top_k` (8)	k (param)
`bm25_weight`	0.45	0.70
`vector_weight`	0.55	0.30
`rrf_k`	20	20
`bm25_k1`	1.2	1.5
`bm25_b`	0.9	0.75
`title_boost_per_word`	0.004	0.004
`score_threshold`	0.4	0.4

Note: search_docs is BM25-heavy (0.70) since it is used for keyword-oriented document browsing, while hybrid_query is vector-heavy for semantic QA.

5. Key Classes & Modules

Services (`app/services/`)

`RAGService`

Main orchestrator. Singleton via lru_cache in dependencies.py.

Method	Description
`query()`	Semantic-only QA (vector search → LLM)
`hybrid_query()`	Hybrid QA (BM25 + vector → RRF → LLM)
`search_docs()`	BM25-heavy document search, no LLM
`ingest_documents()`	Ingest a file path into the vector store
`get_filenames()`	Return all tracked file metadata records
`test_queries()`	Batch retrieval evaluation (MRR, precision, noise)
`test_classifier()`	Batch classifier accuracy evaluation
`delete_database()`	Drop the entire ChromaDB collection

`HybridRetrievalService`

Stateless per-request service created inline by RAGService.

Method	Description
`retrieve(query)`	Full hybrid retrieval pipeline; returns `List[RetrievalResult]`
`_vector_rank()`	Chroma similarity search + classifier filter
`_bm25_rank()`	BM25 over candidate pool
`_reciprocal_rank_fusion()`	Merge both ranked lists via RRF
`_apply_title_boost()`	Word-level title match score bonus

RetrievalResult dataclass:

@dataclass
class RetrievalResult:
    document: Document
    fused_score: float
    bm25_rank: Optional[int]
    vector_rank: Optional[int]
    title_boost: float

`Classifier`

Loaded at startup from a pickled pipeline (chatbot_classifier.pkl).

Method	Description
`predict(queries)`	Returns list of `{type, category, topic, intent, *_conf}` dicts
`predict_with_filter(queries)`	Returns a ChromaDB-compatible filter dict or `None`
`expand_abbreviations(text)`	Regex-based abbreviation expansion
`get_features(queries)`	Build `[SentenceTransformer embedding
`train_models(df)`	Train 4 LogisticRegression classifiers (offline use)

`IngestionService`

Method	Description
`ingest(file_path)`	Load + chunk a file; returns `List[Document]`
`handle_json_docs()`	Intent-aware chunking for structured JSON data
`handle_text_docs()`	Recursive character splitting for unstructured text
`get_records()`	Delegate to `FileService.get_records()`
`delete_record(filename)`	Remove a file's metadata record
`path_record(path, metadata)`	Patch ingestion stats after indexing

`FileService`

Method	Description
`read_file(path)`	Load file content; dispatches by extension
`write_file(path, content, metadata)`	Persist file to `data/documents/`
`patch_metadata(path, metadata)`	Merge new fields into existing record
`get_records()`	Return all ingestion records dict
`delete_record(filename)`	Remove a record from `<collection>.json`

`VectorStore`

Thin wrapper around langchain_chroma.Chroma.

Method	Description
`get()`	Retrieve all documents
`get_by_id(ids)`	Retrieve specific documents by ID
`add_documents(docs)`	Embed + insert, skipping empty chunks
`update_document(id, doc)`	Delete then re-insert with same ID
`delete(ids)`	Remove documents by ID list
`similarity_search_with_score()`	Wrapped Chroma search

Utilities (`app/utils/`)

`preprocessing.py`

Function	Description
`preprocess(text)`	spaCy POS filter + lemmatize + stopword removal → joined string
`normalize(text)`	Tokenize + strip blanks (lightweight, no POS)
`preprocess_query(query)`	Applies `normalize()` to user queries
`preprocess_documents(docs)`	Applies `preprocess()` to a document list in-place
`preprocess_filename(path)`	Sanitize filename (remove special chars, lowercase)

`document_helpers.py`

Function	Description
`get_references_v2(docs, threshold)`	Convert `RetrievalResult` list → references dict + context string
`get_references(docs, threshold)`	Same for raw `(Document, distance)` tuples (used by `query()`)
`build_metadata(path)`	Parse YAML frontmatter from `.md`/`.txt` files
`create_documents(chunks, ...)`	Attach standard metadata (UUID, timestamps, indices) to chunks
`create_documents_from_text(text)`	Full pipeline: frontmatter parse → split → metadata attach
`clean_metadata(metadata)`	Serialize datetime, coerce non-allowed types to string

`model_factory.py`

Function	Description
`get_embedding_model()`	Returns `GoogleGenerativeAIEmbeddings`
`get_gemini_model()`	Returns `ChatGoogleGenerativeAI`
`get_local_model()`	Returns `ChatLlamaCpp` (GGUF, CPU inference)
`get_llm_model(provider)`	Dispatches to Gemini or Local with fallback logic

API Routes (`app/api/routes/`)

`rag.py` — prefix `/api/v1/rag`

Method	Endpoint	Description
GET	`/`	Health check
POST	`/`	Semantic query
POST	`/hybrid_query`	Hybrid RAG query (primary endpoint)
POST	`/similarity_search`	Hybrid retrieval, no LLM response
POST	`/search`	BM25-heavy document search
POST	`/test`	Batch retrieval evaluation
POST	`/test_classifier`	Classifier accuracy evaluation
GET	`/test_classifier_dataset`	Run built-in test dataset, cache result

`vector_store.py` — prefix `/api/v1/vector`

Method	Endpoint	Description
GET	`/`	List all documents (paginated, filterable)
GET	`/filenames`	List ingested file records
GET	`/{id}`	Get single document by ChromaDB ID
POST	`/`	Upload + ingest file
PUT	`/{id}`	Update document content/metadata
DELETE	`/ids`	Bulk delete by ID list
DELETE	`/{id}`	Delete single document
DELETE	`/`	Filter-based delete (filename/source/contains)

Configuration (`app/core/config.py`)

All settings are read from .env via Pydantic BaseSettings:

class Settings(BaseSettings):
    # Paths
    collection_name: str = "classifier_test_1"
    persist_directory: str = "./data/vector_stores/classifier_test_1"

    # Chunking
    chunk_size: int = 500
    chunk_overlap: int = 100

    # Retrieval
    similarity_top_k: int = 8
    similarity_threshold: float = 0.4

    # LLM Provider
    llm_provider: Literal["gemini", "local"] = "local"
    enable_fallback: bool = True

    # Models
    embedding_model_name: str = "models/gemini-embedding-001"
    gemini_model_name: str = "gemini-2.5-flash-lite"
    local_model_name: str = "EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf"

    # Generation
    max_output_tokens: int = 2048
    local_max_tokens: int = 512

    # Auth
    google_api_key: str  # required — must be in .env

6. Evaluation & Metrics

Retrieval Evaluation (`test_queries` / `POST /api/v1/rag/test`)

Tests each (question, expected_document, expected_chunk_index) triple against hybrid_query:

Metric	Formula	Interpretation
Hit Rate	`hits / total`	% of questions where the exact chunk was retrieved
Top-1 Hit Rate	`rank==1 hits / total`	% of questions where exact chunk was top result
MRR	`mean(1/rank)`	Mean Reciprocal Rank; higher = correct result ranked earlier
Doc Precision	`correct_source_chunks / all_chunks`	How many retrieved chunks came from the right document
Doc Recall	`1 if any correct_source_chunk else 0`	Did we retrieve at least one chunk from the right document?
Doc Noise	`wrong_source_chunks / all_chunks`	Proportion of off-topic chunks in the result set
Error Rate	`1 - hit_rate`	Miss rate for exact chunk retrieval

Test Input Schema:

class TestRequestSchema(BaseModel):
    tests: List[Test]   # question + document + chunk_index
    k: int = 5
    threshold: float = 0.4

Classifier Evaluation (`test_classifier` / `POST /api/v1/rag/test_classifier`)

Evaluates predictions for all 4 classification fields (type, category, topic, intent):

Metric	Notes
Accuracy	`sklearn.accuracy_score`
Precision (macro)	`zero_division=0`
Recall (macro)	`zero_division=0`
F1 Macro	Unweighted average across classes
F1 Weighted	Class-frequency weighted
Classification Report	Full per-class breakdown (`output_dict=True`)

A bundled test dataset is stored in app/utils/tests.py as classifier_test_dataset and can be executed via GET /api/v1/rag/test_classifier_dataset. Results are memoized on the RAGService.evaluation dict for the lifetime of the server process.

7. Known Limitations

Technical Debt

preprocess_query is incomplete. The function signature has an LLM-powered query rewriting block that is commented out. Currently it just calls normalize() (tokenize only), which means no stopword removal or lemmatization is applied to user queries (only to stored documents).
search_docs does not honour filename as a metadata filter in Chroma. The filter is applied in Python post-retrieval, which is inefficient for large collections.
Count intent is synthetic. The "Total <topic>: N" chunk is an auto-generated chunk during ingestion, not from the source document. If source data changes, stale count chunks can remain indexed.
VectorStore.get_dict() has a print(type(rows)) debug statement left in production code.
FileService.__init__ docstring has an extra backtick: "` class docstring.

Planned but Unimplemented

Query rewriting via local LLM — skeleton is commented out in preprocess_query().
Semantic caching — no query result memoization at the API layer.
Re-ranker — no cross-encoder re-ranking step; relies only on RRF + boosting.
topic field is not included in the ChromaDB hard filter — only type + intent are hard-anchored; category and topic are soft $or hints.

Performance Bottlenecks

Local LLM (LlamaCpp) is CPU-only with n_ctx=8096 and n_threads=4. Response latency is high (~10–30s) on low-RAM systems.
Classifier uses SentenceTransformer + TF-IDF features — inference runs on every request with no caching of query embeddings.
BM25 corpus is rebuilt from scratch per request — BM25Retriever.from_documents() is called inside _bm25_rank() each time.
classify_test_dataset in app/utils/tests.py is a very large file (1.8MB) loaded at import time.
The memoized evaluation in rag_service.evaluation is not thread-safe if the server runs with multiple workers.

8. File Structure

VGEC-RAG-Chatbot/
│
├── app/                            # Application package
│   ├── main.py                     # FastAPI app, router mounting, CORS middleware
│   ├── core/
│   │   ├── config.py               # Pydantic Settings (all tuneable params)
│   │   └── paths.py                # Path constants helper
│   │
│   ├── api/
│   │   ├── dependencies.py         # lru_cache singleton for RAGService
│   │   ├── routes/
│   │   │   ├── rag.py              # /rag endpoints (query, test, classifier)
│   │   │   ├── vector_store.py     # /vector endpoints (CRUD for ChromaDB)
│   │   │   └── settings.py         # /settings endpoints
│   │   └── schemas/
│   │       ├── requests.py         # RAGRequest, PaginationParams, etc.
│   │       └── tests.py            # TestRequestSchema, TestClassifierReqSchema
│   │
│   ├── services/
│   │   ├── rag_service.py          # RAGService (main orchestrator)
│   │   ├── hybrid_retrieval.py     # HybridRetrievalService + RRF logic
│   │   ├── classifier_service.py   # Classifier class + singleton clf
│   │   ├── ingestion_service.py    # IngestionService (chunking pipeline)
│   │   ├── file_service.py         # FileService (file I/O + metadata JSON)
│   │   ├── vector_store.py         # VectorStore (thin ChromaDB wrapper)
│   │   ├── text_splitter.py        # TextSplitter (RecursiveCharacter + variants)
│   │   └── document_loader.py      # (legacy loader, not in primary path)
│   │
│   ├── utils/
│   │   ├── preprocessing.py        # preprocess(), normalize(), preprocess_query()
│   │   ├── document_helpers.py     # get_references_v2(), build_metadata(), create_documents()
│   │   ├── model_factory.py        # get_llm_model(), get_embedding_model()
│   │   ├── constants.py            # stopwords list, short_words_mappings
│   │   ├── embeddings.py           # (thin embedding util)
│   │   ├── llm_models.py           # (thin LLM util)
│   │   └── tests.py                # classifier_test_dataset (large, 1.8MB)
│   │
│   └── prompts/
│       └── __init__.py             # SYSTEM_PROMPT, wrap_exaone()
│
├── ml_models/
│   ├── classifier/
│   │   └── chatbot_classifier.pkl  # Pickled pipeline (models, tfidf, label encoders, etc.)
│   ├── embeddings/                 # (Local embedding model weights, if any)
│   └── llm/
│       └── EXAONE-3.5-2.4B-*.gguf # Local LLM weights
│
├── data/
│   ├── department_data/            # Source JSON files per department
│   │   ├── computer_eng.json
│   │   ├── civil.json
│   │   └── ...
│   ├── documents/                  # Persistent copies of ingested files
│   ├── vector_stores/
│   │   └── classifier_test_1/      # ChromaDB persist directory
│   ├── classifier_test_1.json      # Ingestion metadata registry (FileService records)
│   └── other_data/                 # Misc data files
│
├── temp/                           # Staging area for uploaded files (auto-cleared)
├── scripts/                        # Offline scripts (training, testing)
├── tests/                          # Test files
│
├── requirements.txt                # Pinned production dependencies
├── .env                            # Runtime secrets (google_api_key, etc.)
├── .env.example                    # Template for .env
└── CODEBASE_DOCUMENTATION.md       # This file

End of documentation.