Spaces:
Sleeping
VGEC RAG Chatbot β Codebase Documentation
Generated: 2026-03-25
Version: 1.0.0
Scope: Full system β ingestion, retrieval, classification, API, evaluation
Table of Contents
- Project Overview
- System Architecture
- Schema & Data Model
- Retrieval Pipeline
- Key Classes & Modules
- Evaluation & Metrics
- Known Limitations
- File Structure
1. Project Overview
Purpose
VGEC RAG Chatbot is a Retrieval-Augmented Generation (RAG) chatbot for Vishwakarma Government Engineering College (VGEC), Chandkheda, Gujarat. It allows students, faculty, and visitors to query structured information about the institution β departments, faculty, syllabus, labs, intake capacity, and more β through natural language.
Domain
- Institution: VGEC (Government Engineering College, Gujarat)
- Data Coverage: Department-level information for multiple disciplines (Computer Engineering, Civil, Electrical, IT, ECE, etc.)
- Topics: Faculty lists, lab facilities, syllabus details, HOD info, research activities, intake capacity, achievements
Tech Stack
| Layer | Technology |
|---|---|
| API Framework | FastAPI |
| Vector Database | ChromaDB (persistent, local) |
| Embeddings | Google gemini-embedding-001 (via langchain-google-genai) |
| LLM (Cloud) | Google Gemini gemini-2.5-flash-lite |
| LLM (Local) | EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf via llama-cpp-python |
| NLP / Preprocessing | spaCy (en_core_web_sm), NLTK (PorterStemmer) |
| Classifier | Scikit-learn LogisticRegression + SentenceTransformer (MongoDB/mdbr-leaf-mt) |
| BM25 | langchain-community BM25Retriever |
| Chunking | LangChain RecursiveCharacterTextSplitter |
| Config | Pydantic BaseSettings (.env-backed) |
Key Features Implemented
- β Structured JSON ingestion with intent-aware chunking
- β Hybrid retrieval: BM25 + vector search fused via Reciprocal Rank Fusion (RRF)
- β Intent/metadata classification with confidence-gated ChromaDB filters
- β
Abbreviation expansion (
CEβComputer Engineering, etc.) - β Multi-turn conversation history support
- β Dual LLM backend with automatic fallback (Gemini β Local)
- β Full CRUD REST API for vector store management
- β Offline evaluation endpoint (MRR, hit rate, noise rate)
- β Classifier accuracy evaluation endpoint
2. System Architecture
Component Diagram
ββββββββββββββββββββββββββββ
β FastAPI App β
β /api/v1/rag /vector β
ββββββββββββ¬ββββββββββββββββ
β DI (lru_cache)
ββββββββββββΌββββββββββββββββ
β RAGService β
β (core orchestrator) β
ββββ¬ββββββββββββ¬βββββββββββββ
β β
βββββββββββββββΌβββ βββββΌβββββββββββββββββββ
β IngestionServiceβ β HybridRetrievalServiceβ
β (write path) β β (read path) β
ββββββββ¬ββββββββ β βββββ¬βββββββββββ¬βββββββ β
β β β
ββββββββββββΌβββ ββββββββββββΌβββ ββββββΌβββββββββββ
β FileService β β ClassifierSvcβ β VectorStore β
β (file +meta) β β(clf predict) β β (ChromaDB) β
ββββββββββββββββ βββββββββββββββ βββββββββββββββββ
Data Flow
Ingestion Path
File Upload (PDF/MD/TXT/JSON)
β
βΌ
FileService.read_file() β type-aware loading (PyMuPDF for PDF)
β returns: Document + metadata
βΌ
FileService.write_file() β persist copy to data/documents/
β
βΌ
IngestionService.handle_*_docs() β route by file extension
β
ββ JSON β handle_json_docs() β intent-aware chunks (list / detail / count)
ββ text β handle_text_docs() β RecursiveCharacterTextSplitter + normalize()
β
βΌ
VectorStore.add_documents() β embed + upsert into ChromaDB
β
βΌ
FileService.patch_metadata() β update ingestion record JSON (chunk count, timing, size)
Query Path
User Question
β
βΌ
preprocess_query() β tokenize + strip stopwords (spaCy) + normalize
β
βΌ
HybridRetrievalService.retrieve()
β
ββ clf.expand_abbreviations() β CE β Computer Engineering
ββ clf.predict_with_filter() β LogReg predict β Chroma $and/$or filter
ββ _vector_rank() β ChromaDB similarity_search_with_score (k=15)
ββ _bm25_rank() β BM25 over the vector candidate pool
ββ _reciprocal_rank_fusion() β weighted RRF merge
ββ metadata score boosting β multiply fused scores for confident matches
ββ _apply_title_boost() β per-query-word title match bonus
β
βΌ
get_references_v2() β filter by threshold, build context string
β
βΌ
LLM.invoke(prompt) β Gemini or local LlamaCpp
β
βΌ
Return: { answer, references, context, threshold_used, k_used }
External Dependencies
| Dependency | Role | Provider |
|---|---|---|
| ChromaDB | Persistent vector store | Local disk |
| Google Gemini API | Embeddings + LLM generation | Google Cloud |
| LlamaCpp (GGUF model) | Local LLM fallback | Local CPU |
| Sentence Transformers | Classifier feature extraction | HuggingFace Hub |
spaCy en_core_web_sm |
POS tagging / lemmatization | Local |
3. Schema & Data Model
Source JSON Format
Source data files (e.g. computer_eng.json) follow this schema:
{
"id": "computer-engineering-department",
"name": "Computer Engineering Department",
"source": "https://www.vgecg.ac.in/department.php?dept=3",
"category": "computer_eng",
"type": "department",
"created_date": "2026-02-19",
"content": {
"<topic_key>": {
"list": ["item 1", "item 2", "..."],
"details": "Paragraph describing the topic."
}
}
}
Top-level fields:
| Field | Type | Description |
|---|---|---|
id |
string | Unique document identifier |
name |
string | Human-readable institution/department name |
source |
string | Authoritative URL |
category |
string | Department slug (e.g. computer_eng) |
type |
string | Document type (e.g. department) |
created_date |
string (ISO) | Data creation date |
content |
object | Topic map; each key = a topic |
Chunk Metadata Schema (stored in ChromaDB)
Every vector chunk stored in Chroma carries the following metadata:
| Field | Type | Source |
|---|---|---|
id |
string (UUID) | Auto-generated |
title |
string | Document name / topic key |
source |
string | Source URL |
source_file |
string | Filename (e.g. computer_eng.json) |
type |
string | Taxonomy level 1 (e.g. department) |
category |
string | Taxonomy level 2 (e.g. computer_eng) |
topic |
string | Taxonomy level 3 (e.g. faculty) |
intent |
string | Chunk intent: list, detail, or count |
chunk_index |
int | Sequential index within file |
created_date |
string (ISO) | Ingestion timestamp |
updated_at |
string (ISO) | Last modification timestamp |
ext |
string | Source file extension (json, pdf, md, txt) |
Hierarchical Taxonomy
The classifier predicts and ChromaDB filters operate on a 3-level hierarchy:
type
βββ category
βββ topic
βββ intent (list | detail | count)
Example mapping (Computer Engineering):
type: "department"
βββ category: "computer_eng"
βββ topic: "faculty" β intent: list | detail
βββ topic: "lab" β intent: list | detail
βββ topic: "syllabus" β intent: list | detail
βββ topic: "hod" β intent: list | detail
βββ topic: "intake" β intent: list | detail
βββ topic: "research" β intent: list | detail
βββ topic: "achievements"
Document Chunking Strategy
JSON documents use a hand-crafted, intent-aware strategy in IngestionService.handle_json_docs():
| Intent | Chunk Content | Metadata |
|---|---|---|
list |
Numbered list: 1. item\n2. item\n... |
intent=list |
count |
"Total <topic>: N" (auto-generated) |
intent=count |
detail |
Raw paragraph text | intent=detail |
Text/PDF/Markdown documents use RecursiveCharacterTextSplitter:
- Default:
chunk_size=500,chunk_overlap=100 - Separator priority:
\n\nβ\nββ (character) - Markdown variant respects
---section delimiters - Content is passed through
normalize()(tokenize + strip blanks) before storage
4. Retrieval Pipeline
Query Processing Flow
# Step 1: Normalize input
question = preprocess_query(question)
# β spaCy POS filter (NOUN, PROPN, VERB, NUM, ADJ) + lemmatize + strip stopwords
# Step 2: Expand abbreviations
processed_query = clf.expand_abbreviations(query)
# β "CE dept" β "computer engineering department"
# Step 3: Classify intent/metadata
filters = clf.predict_with_filter([processed_query])
# β {"$and": [{"type": "department"}, {"intent": "list"}, {"$or": [...]}]}
# Step 4: Vector search with optional filter
raw_results = chroma.similarity_search_with_score(query, k=15, filter=filters)
# Fallback: if filtered results empty, retry without filter
# Step 5: BM25 re-rank over vector candidates
bm25_results = BM25Retriever.from_documents(candidate_docs)
# Step 6: RRF fusion
fused_score(d) = bm25_weight * 1/(rrf_k + rank_bm25)
+ vector_weight * 1/(rrf_k + rank_vec)
# Step 7: Metadata confidence boosting
if doc.metadata[field] == predicted_val and conf > 0.90:
result.fused_score *= boost_factor # 1.10β1.20
# Step 8: Title word boost
for word in query_words:
if word in doc.title:
result.fused_score += title_boost_per_word # 0.004
# Step 9: Threshold filter + sort + top-k
results = [r for r in results if r.fused_score >= threshold]
Classifier Thresholds
The Classifier uses two separate threshold tables:
Prediction threshold β below this, the field is set to None (not used at all):
| Field | Threshold |
|---|---|
type |
0.40 |
category |
0.40 |
topic |
0.50 |
intent |
0.60 |
Filter threshold β above this, the field becomes a hard ChromaDB $and filter:
| Field | Threshold |
|---|---|
type |
0.65 |
category |
0.65 |
topic |
0.70 |
Filter Construction Logic (_build_filter)
# Gate: if type confidence < 0.65 β return None (full scan)
# Hard anchors (always included if type passes):
# - type == predicted_type
# - intent == predicted_intent (special: "count" expands to count OR detail)
# Soft hints (combined as $or):
# - category == predicted_category (if conf >= 0.65, else "general")
# - topic == predicted_topic (if conf >= 0.70, else "general")
Hybrid Retrieval Config (Defaults)
| Parameter | hybrid_query |
search_docs |
|---|---|---|
candidate_k |
15 | 15 |
top_k (final) |
settings.similarity_top_k (8) |
k (param) |
bm25_weight |
0.45 | 0.70 |
vector_weight |
0.55 | 0.30 |
rrf_k |
20 | 20 |
bm25_k1 |
1.2 | 1.5 |
bm25_b |
0.9 | 0.75 |
title_boost_per_word |
0.004 | 0.004 |
score_threshold |
0.4 | 0.4 |
Note:
search_docsis BM25-heavy (0.70) since it is used for keyword-oriented document browsing, whilehybrid_queryis vector-heavy for semantic QA.
5. Key Classes & Modules
Services (app/services/)
RAGService
Main orchestrator. Singleton via lru_cache in dependencies.py.
| Method | Description |
|---|---|
query() |
Semantic-only QA (vector search β LLM) |
hybrid_query() |
Hybrid QA (BM25 + vector β RRF β LLM) |
search_docs() |
BM25-heavy document search, no LLM |
ingest_documents() |
Ingest a file path into the vector store |
get_filenames() |
Return all tracked file metadata records |
test_queries() |
Batch retrieval evaluation (MRR, precision, noise) |
test_classifier() |
Batch classifier accuracy evaluation |
delete_database() |
Drop the entire ChromaDB collection |
HybridRetrievalService
Stateless per-request service created inline by RAGService.
| Method | Description |
|---|---|
retrieve(query) |
Full hybrid retrieval pipeline; returns List[RetrievalResult] |
_vector_rank() |
Chroma similarity search + classifier filter |
_bm25_rank() |
BM25 over candidate pool |
_reciprocal_rank_fusion() |
Merge both ranked lists via RRF |
_apply_title_boost() |
Word-level title match score bonus |
RetrievalResult dataclass:
@dataclass
class RetrievalResult:
document: Document
fused_score: float
bm25_rank: Optional[int]
vector_rank: Optional[int]
title_boost: float
Classifier
Loaded at startup from a pickled pipeline (chatbot_classifier.pkl).
| Method | Description |
|---|---|
predict(queries) |
Returns list of {type, category, topic, intent, *_conf} dicts |
predict_with_filter(queries) |
Returns a ChromaDB-compatible filter dict or None |
expand_abbreviations(text) |
Regex-based abbreviation expansion |
get_features(queries) |
Build `[SentenceTransformer embedding |
train_models(df) |
Train 4 LogisticRegression classifiers (offline use) |
IngestionService
| Method | Description |
|---|---|
ingest(file_path) |
Load + chunk a file; returns List[Document] |
handle_json_docs() |
Intent-aware chunking for structured JSON data |
handle_text_docs() |
Recursive character splitting for unstructured text |
get_records() |
Delegate to FileService.get_records() |
delete_record(filename) |
Remove a file's metadata record |
path_record(path, metadata) |
Patch ingestion stats after indexing |
FileService
| Method | Description |
|---|---|
read_file(path) |
Load file content; dispatches by extension |
write_file(path, content, metadata) |
Persist file to data/documents/ |
patch_metadata(path, metadata) |
Merge new fields into existing record |
get_records() |
Return all ingestion records dict |
delete_record(filename) |
Remove a record from <collection>.json |
VectorStore
Thin wrapper around langchain_chroma.Chroma.
| Method | Description |
|---|---|
get() |
Retrieve all documents |
get_by_id(ids) |
Retrieve specific documents by ID |
add_documents(docs) |
Embed + insert, skipping empty chunks |
update_document(id, doc) |
Delete then re-insert with same ID |
delete(ids) |
Remove documents by ID list |
similarity_search_with_score() |
Wrapped Chroma search |
Utilities (app/utils/)
preprocessing.py
| Function | Description |
|---|---|
preprocess(text) |
spaCy POS filter + lemmatize + stopword removal β joined string |
normalize(text) |
Tokenize + strip blanks (lightweight, no POS) |
preprocess_query(query) |
Applies normalize() to user queries |
preprocess_documents(docs) |
Applies preprocess() to a document list in-place |
preprocess_filename(path) |
Sanitize filename (remove special chars, lowercase) |
document_helpers.py
| Function | Description |
|---|---|
get_references_v2(docs, threshold) |
Convert RetrievalResult list β references dict + context string |
get_references(docs, threshold) |
Same for raw (Document, distance) tuples (used by query()) |
build_metadata(path) |
Parse YAML frontmatter from .md/.txt files |
create_documents(chunks, ...) |
Attach standard metadata (UUID, timestamps, indices) to chunks |
create_documents_from_text(text) |
Full pipeline: frontmatter parse β split β metadata attach |
clean_metadata(metadata) |
Serialize datetime, coerce non-allowed types to string |
model_factory.py
| Function | Description |
|---|---|
get_embedding_model() |
Returns GoogleGenerativeAIEmbeddings |
get_gemini_model() |
Returns ChatGoogleGenerativeAI |
get_local_model() |
Returns ChatLlamaCpp (GGUF, CPU inference) |
get_llm_model(provider) |
Dispatches to Gemini or Local with fallback logic |
API Routes (app/api/routes/)
rag.py β prefix /api/v1/rag
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check |
| POST | / |
Semantic query |
| POST | /hybrid_query |
Hybrid RAG query (primary endpoint) |
| POST | /similarity_search |
Hybrid retrieval, no LLM response |
| POST | /search |
BM25-heavy document search |
| POST | /test |
Batch retrieval evaluation |
| POST | /test_classifier |
Classifier accuracy evaluation |
| GET | /test_classifier_dataset |
Run built-in test dataset, cache result |
vector_store.py β prefix /api/v1/vector
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
List all documents (paginated, filterable) |
| GET | /filenames |
List ingested file records |
| GET | /{id} |
Get single document by ChromaDB ID |
| POST | / |
Upload + ingest file |
| PUT | /{id} |
Update document content/metadata |
| DELETE | /ids |
Bulk delete by ID list |
| DELETE | /{id} |
Delete single document |
| DELETE | / |
Filter-based delete (filename/source/contains) |
Configuration (app/core/config.py)
All settings are read from .env via Pydantic BaseSettings:
class Settings(BaseSettings):
# Paths
collection_name: str = "classifier_test_1"
persist_directory: str = "./data/vector_stores/classifier_test_1"
# Chunking
chunk_size: int = 500
chunk_overlap: int = 100
# Retrieval
similarity_top_k: int = 8
similarity_threshold: float = 0.4
# LLM Provider
llm_provider: Literal["gemini", "local"] = "local"
enable_fallback: bool = True
# Models
embedding_model_name: str = "models/gemini-embedding-001"
gemini_model_name: str = "gemini-2.5-flash-lite"
local_model_name: str = "EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf"
# Generation
max_output_tokens: int = 2048
local_max_tokens: int = 512
# Auth
google_api_key: str # required β must be in .env
6. Evaluation & Metrics
Retrieval Evaluation (test_queries / POST /api/v1/rag/test)
Tests each (question, expected_document, expected_chunk_index) triple against hybrid_query:
| Metric | Formula | Interpretation |
|---|---|---|
| Hit Rate | hits / total |
% of questions where the exact chunk was retrieved |
| Top-1 Hit Rate | rank==1 hits / total |
% of questions where exact chunk was top result |
| MRR | mean(1/rank) |
Mean Reciprocal Rank; higher = correct result ranked earlier |
| Doc Precision | correct_source_chunks / all_chunks |
How many retrieved chunks came from the right document |
| Doc Recall | 1 if any correct_source_chunk else 0 |
Did we retrieve at least one chunk from the right document? |
| Doc Noise | wrong_source_chunks / all_chunks |
Proportion of off-topic chunks in the result set |
| Error Rate | 1 - hit_rate |
Miss rate for exact chunk retrieval |
Test Input Schema:
class TestRequestSchema(BaseModel):
tests: List[Test] # question + document + chunk_index
k: int = 5
threshold: float = 0.4
Classifier Evaluation (test_classifier / POST /api/v1/rag/test_classifier)
Evaluates predictions for all 4 classification fields (type, category, topic, intent):
| Metric | Notes |
|---|---|
| Accuracy | sklearn.accuracy_score |
| Precision (macro) | zero_division=0 |
| Recall (macro) | zero_division=0 |
| F1 Macro | Unweighted average across classes |
| F1 Weighted | Class-frequency weighted |
| Classification Report | Full per-class breakdown (output_dict=True) |
A bundled test dataset is stored in app/utils/tests.py as classifier_test_dataset and can be executed via GET /api/v1/rag/test_classifier_dataset. Results are memoized on the RAGService.evaluation dict for the lifetime of the server process.
7. Known Limitations
Technical Debt
preprocess_queryis incomplete. The function signature has an LLM-powered query rewriting block that is commented out. Currently it just callsnormalize()(tokenize only), which means no stopword removal or lemmatization is applied to user queries (only to stored documents).search_docsdoes not honourfilenameas a metadata filter in Chroma. The filter is applied in Python post-retrieval, which is inefficient for large collections.- Count intent is synthetic. The
"Total <topic>: N"chunk is an auto-generated chunk during ingestion, not from the source document. If source data changes, stale count chunks can remain indexed. VectorStore.get_dict()has aprint(type(rows))debug statement left in production code.FileService.__init__docstring has an extra backtick:"`class docstring.
Planned but Unimplemented
- Query rewriting via local LLM β skeleton is commented out in
preprocess_query(). - Semantic caching β no query result memoization at the API layer.
- Re-ranker β no cross-encoder re-ranking step; relies only on RRF + boosting.
topicfield is not included in the ChromaDB hard filter β onlytype+intentare hard-anchored;categoryandtopicare soft$orhints.
Performance Bottlenecks
- Local LLM (LlamaCpp) is CPU-only with
n_ctx=8096andn_threads=4. Response latency is high (~10β30s) on low-RAM systems. - Classifier uses
SentenceTransformer+TF-IDFfeatures β inference runs on every request with no caching of query embeddings. - BM25 corpus is rebuilt from scratch per request β
BM25Retriever.from_documents()is called inside_bm25_rank()each time. classify_test_datasetinapp/utils/tests.pyis a very large file (1.8MB) loaded at import time.- The memoized evaluation in
rag_service.evaluationis not thread-safe if the server runs with multiple workers.
8. File Structure
VGEC-RAG-Chatbot/
β
βββ app/ # Application package
β βββ main.py # FastAPI app, router mounting, CORS middleware
β βββ core/
β β βββ config.py # Pydantic Settings (all tuneable params)
β β βββ paths.py # Path constants helper
β β
β βββ api/
β β βββ dependencies.py # lru_cache singleton for RAGService
β β βββ routes/
β β β βββ rag.py # /rag endpoints (query, test, classifier)
β β β βββ vector_store.py # /vector endpoints (CRUD for ChromaDB)
β β β βββ settings.py # /settings endpoints
β β βββ schemas/
β β βββ requests.py # RAGRequest, PaginationParams, etc.
β β βββ tests.py # TestRequestSchema, TestClassifierReqSchema
β β
β βββ services/
β β βββ rag_service.py # RAGService (main orchestrator)
β β βββ hybrid_retrieval.py # HybridRetrievalService + RRF logic
β β βββ classifier_service.py # Classifier class + singleton clf
β β βββ ingestion_service.py # IngestionService (chunking pipeline)
β β βββ file_service.py # FileService (file I/O + metadata JSON)
β β βββ vector_store.py # VectorStore (thin ChromaDB wrapper)
β β βββ text_splitter.py # TextSplitter (RecursiveCharacter + variants)
β β βββ document_loader.py # (legacy loader, not in primary path)
β β
β βββ utils/
β β βββ preprocessing.py # preprocess(), normalize(), preprocess_query()
β β βββ document_helpers.py # get_references_v2(), build_metadata(), create_documents()
β β βββ model_factory.py # get_llm_model(), get_embedding_model()
β β βββ constants.py # stopwords list, short_words_mappings
β β βββ embeddings.py # (thin embedding util)
β β βββ llm_models.py # (thin LLM util)
β β βββ tests.py # classifier_test_dataset (large, 1.8MB)
β β
β βββ prompts/
β βββ __init__.py # SYSTEM_PROMPT, wrap_exaone()
β
βββ ml_models/
β βββ classifier/
β β βββ chatbot_classifier.pkl # Pickled pipeline (models, tfidf, label encoders, etc.)
β βββ embeddings/ # (Local embedding model weights, if any)
β βββ llm/
β βββ EXAONE-3.5-2.4B-*.gguf # Local LLM weights
β
βββ data/
β βββ department_data/ # Source JSON files per department
β β βββ computer_eng.json
β β βββ civil.json
β β βββ ...
β βββ documents/ # Persistent copies of ingested files
β βββ vector_stores/
β β βββ classifier_test_1/ # ChromaDB persist directory
β βββ classifier_test_1.json # Ingestion metadata registry (FileService records)
β βββ other_data/ # Misc data files
β
βββ temp/ # Staging area for uploaded files (auto-cleared)
βββ scripts/ # Offline scripts (training, testing)
βββ tests/ # Test files
β
βββ requirements.txt # Pinned production dependencies
βββ .env # Runtime secrets (google_api_key, etc.)
βββ .env.example # Template for .env
βββ CODEBASE_DOCUMENTATION.md # This file
End of documentation.