Spaces:

snygginghani
/

kasitbot

Running

App Files Files Community

kasitbot / README.md

snygginghani

Fix HF Space color metadata

193d182 10 days ago

preview code

raw

history blame contribute delete

7.94 kB

metadata

title: KASITBot
emoji: 🎓
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit

KASITBot 🎓

Intelligent Academic Assistant for the King Abdullah II School of Information Technology (KASIT), University of Jordan

A bilingual (Arabic / English) RAG-powered chatbot that answers student questions about courses, exams, faculty, graduation requirements, and research labs — grounded in real faculty documents.

How It Works

KASITBot uses a Hybrid RAG pipeline to find answers:

Vector Search — embeds the query with text-embedding-3-large and finds the 20 most semantically similar document chunks in FAISS
BM25 Keyword Search — finds exact keyword matches (course codes, professor names, room numbers, Arabic proper nouns)
RRF Fusion — merges both result lists into a single unified ranking using Reciprocal Rank Fusion
Cross-Encoder Re-Ranking — scores each candidate against the query together, picking the best 5
GPT-4o Answer Generation — only the top 5 re-ranked chunks are sent to the model, keeping context precise and clean

User Question
     │
     ├──► Vector Search (FAISS)  ─────┐
     │                                ├──► RRF Fusion ──► Re-Ranker ──► GPT-4o ──► Answer
     └──► BM25 Keyword Search ────────┘

Project Structure

kasitbot/
├── app.py                          # Flask web server + RAG chat logic
├── rag_preprocessor.py             # Step 1: Extract & chunk PDF/DOCX documents
├── embedding_generator.py          # Step 2: Embed chunks with OpenAI
├── vector_store.py                 # Step 3: Build & save FAISS index
├── faiss_index/
│   ├── index.faiss                 # Binary vector index (generated)
│   └── metadata.json              # Chunk text + source info (generated)
├── input_documents/                # Place your PDF and DOCX files here
├── rag_dataset.json               # Preprocessor output (generated)
├── rag_dataset_with_embeddings.json  # Embeddings output (generated)
├── .env                           # Your API key (never commit this)
├── .env.example                   # Template for .env
└── requirements.txt

Setup & Installation

Prerequisites

Python 3.10+
An OpenAI API key

1. Clone and install dependencies

git clone <your-repo-url>
cd kasitbot
pip install -r requirements.txt

2. Configure your API key

cp .env.example .env

Open .env and replace the placeholder:

OPENAI_API_KEY=sk-your-real-key-here

⚠️ Never commit your .env file. Add it to .gitignore.

3. Add your documents

Place your faculty PDF and DOCX files inside input_documents/:

input_documents/
├── kasit_handbook.pdf
├── course_catalog.docx
├── faculty_schedules.pdf
└── exam_timetable.docx

Both Arabic and English documents are supported natively — no translation needed.

Building the Index

Run these three steps in order whenever you add or update documents:

# Step 1 — Extract and chunk documents
python rag_preprocessor.py

# Step 2 — Generate embeddings with text-embedding-3-large
python embedding_generator.py

# Step 3 — Build and save the FAISS vector index
python vector_store.py

You only need to re-run these when your source documents change. The index is saved to disk and loaded automatically when the app starts.

Running the App

python app.py

Open your browser at: http://localhost:5000

API Reference

`POST /api/chat`

Send a conversation and receive an answer.

Request body:

{
  "messages": [
    { "role": "user", "content": "What are the graduation credit requirements?" }
  ],
  "lang": "en"
}

Field	Type	Description
`messages`	array	Full conversation history in ChatGPT format
`lang`	string	`"en"` for English, `"ar"` for Arabic

Response:

{
  "answer": "To graduate from KASIT, students must complete...",
  "sources": [
    { "rank": 1, "source": "kasit_handbook.pdf", "chunk_id": 42, "score": 0.91 }
  ],
  "retrieval": "hybrid"
}

`GET /api/health`

Returns the status of all components.

{
  "status": "ok",
  "rag_available": true,
  "bm25_available": true,
  "reranker_available": true,
  "model": "gpt-4o",
  "embedding_model": "text-embedding-3-large",
  "retrieval_mode": "hybrid (vector + BM25 + cross-encoder)"
}

Key Design Decisions

Why no Arabic translation?

Earlier versions translated Arabic document chunks to English before embedding them. This caused two problems: translation errors corrupted the content, and Arabic queries no longer matched the translated embeddings. text-embedding-3-large handles Arabic natively and accurately — translation is unnecessary.

Why hybrid search?

Pure vector search fails on exact lookups. If a student asks "when is the CS401 exam?" or "what does Dr. Omar teach?", vector similarity may miss those because the meaning is encoded differently than the exact terms. BM25 catches these keyword matches. The two approaches complement each other.

Why a re-ranker?

Sending 20 chunks to GPT is noisy — the model can lose the answer in the middle of irrelevant text (the "lost in the middle" problem). The cross-encoder re-ranker reads each candidate alongside the query and scores them together, producing a much more accurate ranking. Only the top 5 go to GPT.

Why no retry loop?

The original code retried up to 3 times when the model said it couldn't find information. Retrying with the same bad index doesn't fix bad retrieval — it just burns tokens and adds latency. Better retrieval at the source is the right fix.

Configuration Reference

All tunable constants are at the top of each file:

Constant	File	Default	Description
`CHUNK_SIZE`	`rag_preprocessor.py`	`400`	Characters per document chunk
`OVERLAP`	`rag_preprocessor.py`	`100`	Overlap between adjacent chunks
`EMBEDDING_MODEL`	all files	`text-embedding-3-large`	Must be identical in all files
`OPENAI_MODEL`	`app.py`	`gpt-4o`	Model used for answer generation
`TOP_K_VECTOR`	`app.py`	`20`	Vector search candidates
`TOP_K_BM25`	`app.py`	`20`	BM25 search candidates
`TOP_K_FINAL`	`app.py`	`5`	Chunks sent to GPT after re-ranking

Requirements

flask
flask-cors
openai
faiss-cpu
numpy
rank_bm25
sentence-transformers
PyMuPDF
python-docx
langdetect
python-dotenv

Important Notes

Re-index after any document change. If you add, remove, or update files in input_documents/, re-run all three build steps.
The embedding model must be consistent. rag_preprocessor.py, embedding_generator.py, vector_store.py, and app.py must all use the same model. Mixing models produces nonsensical search results.
The .env file must never be committed to git. Add .env to your .gitignore.
First startup downloads the re-ranker model (cross-encoder/ms-marco-MiniLM-L-6-v2, ~90 MB) from HuggingFace. Subsequent startups load it from cache.

Built With

Flask — Web framework
OpenAI API — Embeddings (text-embedding-3-large) + Chat (gpt-4o)
FAISS — Vector similarity search
rank_bm25 — BM25 keyword search
Sentence Transformers — Cross-encoder re-ranking
PyMuPDF — PDF text extraction
python-docx — DOCX text extraction