--- title: KASITBot emoji: 🎓 colorFrom: green colorTo: blue sdk: docker app_port: 7860 pinned: false license: mit --- # KASITBot 🎓 **Intelligent Academic Assistant for the King Abdullah II School of Information Technology (KASIT), University of Jordan** A bilingual (Arabic / English) RAG-powered chatbot that answers student questions about courses, exams, faculty, graduation requirements, and research labs — grounded in real faculty documents. --- ## How It Works KASITBot uses a **Hybrid RAG pipeline** to find answers: 1. **Vector Search** — embeds the query with `text-embedding-3-large` and finds the 20 most semantically similar document chunks in FAISS 2. **BM25 Keyword Search** — finds exact keyword matches (course codes, professor names, room numbers, Arabic proper nouns) 3. **RRF Fusion** — merges both result lists into a single unified ranking using Reciprocal Rank Fusion 4. **Cross-Encoder Re-Ranking** — scores each candidate against the query together, picking the best 5 5. **GPT-4o Answer Generation** — only the top 5 re-ranked chunks are sent to the model, keeping context precise and clean ``` User Question │ ├──► Vector Search (FAISS) ─────┐ │ ├──► RRF Fusion ──► Re-Ranker ──► GPT-4o ──► Answer └──► BM25 Keyword Search ────────┘ ``` --- ## Project Structure ``` kasitbot/ ├── app.py # Flask web server + RAG chat logic ├── rag_preprocessor.py # Step 1: Extract & chunk PDF/DOCX documents ├── embedding_generator.py # Step 2: Embed chunks with OpenAI ├── vector_store.py # Step 3: Build & save FAISS index ├── faiss_index/ │ ├── index.faiss # Binary vector index (generated) │ └── metadata.json # Chunk text + source info (generated) ├── input_documents/ # Place your PDF and DOCX files here ├── rag_dataset.json # Preprocessor output (generated) ├── rag_dataset_with_embeddings.json # Embeddings output (generated) ├── .env # Your API key (never commit this) ├── .env.example # Template for .env └── requirements.txt ``` --- ## Setup & Installation ### Prerequisites - Python 3.10+ - An OpenAI API key ### 1. Clone and install dependencies ```bash git clone cd kasitbot pip install -r requirements.txt ``` ### 2. Configure your API key ```bash cp .env.example .env ``` Open `.env` and replace the placeholder: ``` OPENAI_API_KEY=sk-your-real-key-here ``` > ⚠️ Never commit your `.env` file. Add it to `.gitignore`. ### 3. Add your documents Place your faculty PDF and DOCX files inside `input_documents/`: ``` input_documents/ ├── kasit_handbook.pdf ├── course_catalog.docx ├── faculty_schedules.pdf └── exam_timetable.docx ``` Both Arabic and English documents are supported natively — no translation needed. --- ## Building the Index Run these three steps **in order** whenever you add or update documents: ```bash # Step 1 — Extract and chunk documents python rag_preprocessor.py # Step 2 — Generate embeddings with text-embedding-3-large python embedding_generator.py # Step 3 — Build and save the FAISS vector index python vector_store.py ``` You only need to re-run these when your source documents change. The index is saved to disk and loaded automatically when the app starts. --- ## Running the App ```bash python app.py ``` Open your browser at: **http://localhost:5000** --- ## API Reference ### `POST /api/chat` Send a conversation and receive an answer. **Request body:** ```json { "messages": [ { "role": "user", "content": "What are the graduation credit requirements?" } ], "lang": "en" } ``` | Field | Type | Description | |---|---|---| | `messages` | array | Full conversation history in ChatGPT format | | `lang` | string | `"en"` for English, `"ar"` for Arabic | **Response:** ```json { "answer": "To graduate from KASIT, students must complete...", "sources": [ { "rank": 1, "source": "kasit_handbook.pdf", "chunk_id": 42, "score": 0.91 } ], "retrieval": "hybrid" } ``` ### `GET /api/health` Returns the status of all components. ```json { "status": "ok", "rag_available": true, "bm25_available": true, "reranker_available": true, "model": "gpt-4o", "embedding_model": "text-embedding-3-large", "retrieval_mode": "hybrid (vector + BM25 + cross-encoder)" } ``` --- ## Key Design Decisions ### Why no Arabic translation? Earlier versions translated Arabic document chunks to English before embedding them. This caused two problems: translation errors corrupted the content, and Arabic queries no longer matched the translated embeddings. `text-embedding-3-large` handles Arabic natively and accurately — translation is unnecessary. ### Why hybrid search? Pure vector search fails on exact lookups. If a student asks "when is the CS401 exam?" or "what does Dr. Omar teach?", vector similarity may miss those because the meaning is encoded differently than the exact terms. BM25 catches these keyword matches. The two approaches complement each other. ### Why a re-ranker? Sending 20 chunks to GPT is noisy — the model can lose the answer in the middle of irrelevant text (the "lost in the middle" problem). The cross-encoder re-ranker reads each candidate alongside the query and scores them together, producing a much more accurate ranking. Only the top 5 go to GPT. ### Why no retry loop? The original code retried up to 3 times when the model said it couldn't find information. Retrying with the same bad index doesn't fix bad retrieval — it just burns tokens and adds latency. Better retrieval at the source is the right fix. --- ## Configuration Reference All tunable constants are at the top of each file: | Constant | File | Default | Description | |---|---|---|---| | `CHUNK_SIZE` | `rag_preprocessor.py` | `400` | Characters per document chunk | | `OVERLAP` | `rag_preprocessor.py` | `100` | Overlap between adjacent chunks | | `EMBEDDING_MODEL` | all files | `text-embedding-3-large` | Must be identical in all files | | `OPENAI_MODEL` | `app.py` | `gpt-4o` | Model used for answer generation | | `TOP_K_VECTOR` | `app.py` | `20` | Vector search candidates | | `TOP_K_BM25` | `app.py` | `20` | BM25 search candidates | | `TOP_K_FINAL` | `app.py` | `5` | Chunks sent to GPT after re-ranking | --- ## Requirements ``` flask flask-cors openai faiss-cpu numpy rank_bm25 sentence-transformers PyMuPDF python-docx langdetect python-dotenv ``` --- ## Important Notes - **Re-index after any document change.** If you add, remove, or update files in `input_documents/`, re-run all three build steps. - **The embedding model must be consistent.** `rag_preprocessor.py`, `embedding_generator.py`, `vector_store.py`, and `app.py` must all use the same model. Mixing models produces nonsensical search results. - **The `.env` file must never be committed to git.** Add `.env` to your `.gitignore`. - **First startup downloads the re-ranker model** (`cross-encoder/ms-marco-MiniLM-L-6-v2`, ~90 MB) from HuggingFace. Subsequent startups load it from cache. --- ## Built With - [Flask](https://flask.palletsprojects.com/) — Web framework - [OpenAI API](https://platform.openai.com/) — Embeddings (`text-embedding-3-large`) + Chat (`gpt-4o`) - [FAISS](https://github.com/facebookresearch/faiss) — Vector similarity search - [rank_bm25](https://github.com/dorianbrown/rank_bm25) — BM25 keyword search - [Sentence Transformers](https://www.sbert.net/) — Cross-encoder re-ranking - [PyMuPDF](https://pymupdf.readthedocs.io/) — PDF text extraction - [python-docx](https://python-docx.readthedocs.io/) — DOCX text extraction