kasitbot / README.md
snygginghani's picture
Fix HF Space color metadata
193d182
metadata
title: KASITBot
emoji: πŸŽ“
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit

KASITBot πŸŽ“

Intelligent Academic Assistant for the King Abdullah II School of Information Technology (KASIT), University of Jordan

A bilingual (Arabic / English) RAG-powered chatbot that answers student questions about courses, exams, faculty, graduation requirements, and research labs β€” grounded in real faculty documents.


How It Works

KASITBot uses a Hybrid RAG pipeline to find answers:

  1. Vector Search β€” embeds the query with text-embedding-3-large and finds the 20 most semantically similar document chunks in FAISS
  2. BM25 Keyword Search β€” finds exact keyword matches (course codes, professor names, room numbers, Arabic proper nouns)
  3. RRF Fusion β€” merges both result lists into a single unified ranking using Reciprocal Rank Fusion
  4. Cross-Encoder Re-Ranking β€” scores each candidate against the query together, picking the best 5
  5. GPT-4o Answer Generation β€” only the top 5 re-ranked chunks are sent to the model, keeping context precise and clean
User Question
     β”‚
     β”œβ”€β”€β–Ί Vector Search (FAISS)  ─────┐
     β”‚                                β”œβ”€β”€β–Ί RRF Fusion ──► Re-Ranker ──► GPT-4o ──► Answer
     └──► BM25 Keyword Search β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

kasitbot/
β”œβ”€β”€ app.py                          # Flask web server + RAG chat logic
β”œβ”€β”€ rag_preprocessor.py             # Step 1: Extract & chunk PDF/DOCX documents
β”œβ”€β”€ embedding_generator.py          # Step 2: Embed chunks with OpenAI
β”œβ”€β”€ vector_store.py                 # Step 3: Build & save FAISS index
β”œβ”€β”€ faiss_index/
β”‚   β”œβ”€β”€ index.faiss                 # Binary vector index (generated)
β”‚   └── metadata.json              # Chunk text + source info (generated)
β”œβ”€β”€ input_documents/                # Place your PDF and DOCX files here
β”œβ”€β”€ rag_dataset.json               # Preprocessor output (generated)
β”œβ”€β”€ rag_dataset_with_embeddings.json  # Embeddings output (generated)
β”œβ”€β”€ .env                           # Your API key (never commit this)
β”œβ”€β”€ .env.example                   # Template for .env
└── requirements.txt

Setup & Installation

Prerequisites

  • Python 3.10+
  • An OpenAI API key

1. Clone and install dependencies

git clone <your-repo-url>
cd kasitbot
pip install -r requirements.txt

2. Configure your API key

cp .env.example .env

Open .env and replace the placeholder:

OPENAI_API_KEY=sk-your-real-key-here

⚠️ Never commit your .env file. Add it to .gitignore.

3. Add your documents

Place your faculty PDF and DOCX files inside input_documents/:

input_documents/
β”œβ”€β”€ kasit_handbook.pdf
β”œβ”€β”€ course_catalog.docx
β”œβ”€β”€ faculty_schedules.pdf
└── exam_timetable.docx

Both Arabic and English documents are supported natively β€” no translation needed.


Building the Index

Run these three steps in order whenever you add or update documents:

# Step 1 β€” Extract and chunk documents
python rag_preprocessor.py

# Step 2 β€” Generate embeddings with text-embedding-3-large
python embedding_generator.py

# Step 3 β€” Build and save the FAISS vector index
python vector_store.py

You only need to re-run these when your source documents change. The index is saved to disk and loaded automatically when the app starts.


Running the App

python app.py

Open your browser at: http://localhost:5000


API Reference

POST /api/chat

Send a conversation and receive an answer.

Request body:

{
  "messages": [
    { "role": "user", "content": "What are the graduation credit requirements?" }
  ],
  "lang": "en"
}
Field Type Description
messages array Full conversation history in ChatGPT format
lang string "en" for English, "ar" for Arabic

Response:

{
  "answer": "To graduate from KASIT, students must complete...",
  "sources": [
    { "rank": 1, "source": "kasit_handbook.pdf", "chunk_id": 42, "score": 0.91 }
  ],
  "retrieval": "hybrid"
}

GET /api/health

Returns the status of all components.

{
  "status": "ok",
  "rag_available": true,
  "bm25_available": true,
  "reranker_available": true,
  "model": "gpt-4o",
  "embedding_model": "text-embedding-3-large",
  "retrieval_mode": "hybrid (vector + BM25 + cross-encoder)"
}

Key Design Decisions

Why no Arabic translation?

Earlier versions translated Arabic document chunks to English before embedding them. This caused two problems: translation errors corrupted the content, and Arabic queries no longer matched the translated embeddings. text-embedding-3-large handles Arabic natively and accurately β€” translation is unnecessary.

Why hybrid search?

Pure vector search fails on exact lookups. If a student asks "when is the CS401 exam?" or "what does Dr. Omar teach?", vector similarity may miss those because the meaning is encoded differently than the exact terms. BM25 catches these keyword matches. The two approaches complement each other.

Why a re-ranker?

Sending 20 chunks to GPT is noisy β€” the model can lose the answer in the middle of irrelevant text (the "lost in the middle" problem). The cross-encoder re-ranker reads each candidate alongside the query and scores them together, producing a much more accurate ranking. Only the top 5 go to GPT.

Why no retry loop?

The original code retried up to 3 times when the model said it couldn't find information. Retrying with the same bad index doesn't fix bad retrieval β€” it just burns tokens and adds latency. Better retrieval at the source is the right fix.


Configuration Reference

All tunable constants are at the top of each file:

Constant File Default Description
CHUNK_SIZE rag_preprocessor.py 400 Characters per document chunk
OVERLAP rag_preprocessor.py 100 Overlap between adjacent chunks
EMBEDDING_MODEL all files text-embedding-3-large Must be identical in all files
OPENAI_MODEL app.py gpt-4o Model used for answer generation
TOP_K_VECTOR app.py 20 Vector search candidates
TOP_K_BM25 app.py 20 BM25 search candidates
TOP_K_FINAL app.py 5 Chunks sent to GPT after re-ranking

Requirements

flask
flask-cors
openai
faiss-cpu
numpy
rank_bm25
sentence-transformers
PyMuPDF
python-docx
langdetect
python-dotenv

Important Notes

  • Re-index after any document change. If you add, remove, or update files in input_documents/, re-run all three build steps.
  • The embedding model must be consistent. rag_preprocessor.py, embedding_generator.py, vector_store.py, and app.py must all use the same model. Mixing models produces nonsensical search results.
  • The .env file must never be committed to git. Add .env to your .gitignore.
  • First startup downloads the re-ranker model (cross-encoder/ms-marco-MiniLM-L-6-v2, ~90 MB) from HuggingFace. Subsequent startups load it from cache.

Built With