Spaces:
Running
title: KASITBot
emoji: π
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
KASITBot π
Intelligent Academic Assistant for the King Abdullah II School of Information Technology (KASIT), University of Jordan
A bilingual (Arabic / English) RAG-powered chatbot that answers student questions about courses, exams, faculty, graduation requirements, and research labs β grounded in real faculty documents.
How It Works
KASITBot uses a Hybrid RAG pipeline to find answers:
- Vector Search β embeds the query with
text-embedding-3-largeand finds the 20 most semantically similar document chunks in FAISS - BM25 Keyword Search β finds exact keyword matches (course codes, professor names, room numbers, Arabic proper nouns)
- RRF Fusion β merges both result lists into a single unified ranking using Reciprocal Rank Fusion
- Cross-Encoder Re-Ranking β scores each candidate against the query together, picking the best 5
- GPT-4o Answer Generation β only the top 5 re-ranked chunks are sent to the model, keeping context precise and clean
User Question
β
ββββΊ Vector Search (FAISS) ββββββ
β ββββΊ RRF Fusion βββΊ Re-Ranker βββΊ GPT-4o βββΊ Answer
ββββΊ BM25 Keyword Search βββββββββ
Project Structure
kasitbot/
βββ app.py # Flask web server + RAG chat logic
βββ rag_preprocessor.py # Step 1: Extract & chunk PDF/DOCX documents
βββ embedding_generator.py # Step 2: Embed chunks with OpenAI
βββ vector_store.py # Step 3: Build & save FAISS index
βββ faiss_index/
β βββ index.faiss # Binary vector index (generated)
β βββ metadata.json # Chunk text + source info (generated)
βββ input_documents/ # Place your PDF and DOCX files here
βββ rag_dataset.json # Preprocessor output (generated)
βββ rag_dataset_with_embeddings.json # Embeddings output (generated)
βββ .env # Your API key (never commit this)
βββ .env.example # Template for .env
βββ requirements.txt
Setup & Installation
Prerequisites
- Python 3.10+
- An OpenAI API key
1. Clone and install dependencies
git clone <your-repo-url>
cd kasitbot
pip install -r requirements.txt
2. Configure your API key
cp .env.example .env
Open .env and replace the placeholder:
OPENAI_API_KEY=sk-your-real-key-here
β οΈ Never commit your
.envfile. Add it to.gitignore.
3. Add your documents
Place your faculty PDF and DOCX files inside input_documents/:
input_documents/
βββ kasit_handbook.pdf
βββ course_catalog.docx
βββ faculty_schedules.pdf
βββ exam_timetable.docx
Both Arabic and English documents are supported natively β no translation needed.
Building the Index
Run these three steps in order whenever you add or update documents:
# Step 1 β Extract and chunk documents
python rag_preprocessor.py
# Step 2 β Generate embeddings with text-embedding-3-large
python embedding_generator.py
# Step 3 β Build and save the FAISS vector index
python vector_store.py
You only need to re-run these when your source documents change. The index is saved to disk and loaded automatically when the app starts.
Running the App
python app.py
Open your browser at: http://localhost:5000
API Reference
POST /api/chat
Send a conversation and receive an answer.
Request body:
{
"messages": [
{ "role": "user", "content": "What are the graduation credit requirements?" }
],
"lang": "en"
}
| Field | Type | Description |
|---|---|---|
messages |
array | Full conversation history in ChatGPT format |
lang |
string | "en" for English, "ar" for Arabic |
Response:
{
"answer": "To graduate from KASIT, students must complete...",
"sources": [
{ "rank": 1, "source": "kasit_handbook.pdf", "chunk_id": 42, "score": 0.91 }
],
"retrieval": "hybrid"
}
GET /api/health
Returns the status of all components.
{
"status": "ok",
"rag_available": true,
"bm25_available": true,
"reranker_available": true,
"model": "gpt-4o",
"embedding_model": "text-embedding-3-large",
"retrieval_mode": "hybrid (vector + BM25 + cross-encoder)"
}
Key Design Decisions
Why no Arabic translation?
Earlier versions translated Arabic document chunks to English before embedding them. This caused two problems: translation errors corrupted the content, and Arabic queries no longer matched the translated embeddings. text-embedding-3-large handles Arabic natively and accurately β translation is unnecessary.
Why hybrid search?
Pure vector search fails on exact lookups. If a student asks "when is the CS401 exam?" or "what does Dr. Omar teach?", vector similarity may miss those because the meaning is encoded differently than the exact terms. BM25 catches these keyword matches. The two approaches complement each other.
Why a re-ranker?
Sending 20 chunks to GPT is noisy β the model can lose the answer in the middle of irrelevant text (the "lost in the middle" problem). The cross-encoder re-ranker reads each candidate alongside the query and scores them together, producing a much more accurate ranking. Only the top 5 go to GPT.
Why no retry loop?
The original code retried up to 3 times when the model said it couldn't find information. Retrying with the same bad index doesn't fix bad retrieval β it just burns tokens and adds latency. Better retrieval at the source is the right fix.
Configuration Reference
All tunable constants are at the top of each file:
| Constant | File | Default | Description |
|---|---|---|---|
CHUNK_SIZE |
rag_preprocessor.py |
400 |
Characters per document chunk |
OVERLAP |
rag_preprocessor.py |
100 |
Overlap between adjacent chunks |
EMBEDDING_MODEL |
all files | text-embedding-3-large |
Must be identical in all files |
OPENAI_MODEL |
app.py |
gpt-4o |
Model used for answer generation |
TOP_K_VECTOR |
app.py |
20 |
Vector search candidates |
TOP_K_BM25 |
app.py |
20 |
BM25 search candidates |
TOP_K_FINAL |
app.py |
5 |
Chunks sent to GPT after re-ranking |
Requirements
flask
flask-cors
openai
faiss-cpu
numpy
rank_bm25
sentence-transformers
PyMuPDF
python-docx
langdetect
python-dotenv
Important Notes
- Re-index after any document change. If you add, remove, or update files in
input_documents/, re-run all three build steps. - The embedding model must be consistent.
rag_preprocessor.py,embedding_generator.py,vector_store.py, andapp.pymust all use the same model. Mixing models produces nonsensical search results. - The
.envfile must never be committed to git. Add.envto your.gitignore. - First startup downloads the re-ranker model (
cross-encoder/ms-marco-MiniLM-L-6-v2, ~90 MB) from HuggingFace. Subsequent startups load it from cache.
Built With
- Flask β Web framework
- OpenAI API β Embeddings (
text-embedding-3-large) + Chat (gpt-4o) - FAISS β Vector similarity search
- rank_bm25 β BM25 keyword search
- Sentence Transformers β Cross-encoder re-ranking
- PyMuPDF β PDF text extraction
- python-docx β DOCX text extraction