Spaces:

Subhakanta156
/

Odisha-Disaster-Chatbot

Sleeping

App Files Files Community

Subhakanta commited on Oct 2, 2025

Commit

4787e22

0 Parent(s):

Initial commit without data folder

Browse files

Files changed (15) hide show

.dockerignore +11 -0
.gitignore +30 -0
Dockerfile +44 -0
LICENSE +21 -0
README.md +109 -0
app.py +13 -0
backend/api.py +33 -0
backend/models.py +7 -0
frontend/index.html +22 -0
frontend/script.js +37 -0
frontend/style.css +86 -0
src/chatbot.py +148 -0
src/check_index.py +25 -0
src/ingest.py +86 -0
src/query.py +47 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,11 @@

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.env
+.venv
+venv/
+.env/
+.git/
+.gitignore
+/data

.gitignore ADDED Viewed

	@@ -0,0 +1,30 @@

+# Python Virtual Environments
+rag_env/
+venv/
+env/
+.venv/
+# Secret Keys and Environment Variables
+.env
+# Generated Data & Vector Stores
+vectorStore/
+/backend/vectorStore/
+data/
+*.pdf
+*.txt
+# Python specific
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+# IDE / Editor specific settings
+# Ignore personal editor configurations
+.vscode/
+.idea/
+# OS-specific files
+# Ignore files generated by macOS and Windows
+.DS_Store
+Thumbs.db

Dockerfile ADDED Viewed

	@@ -0,0 +1,44 @@

+# ================================
+# 1. Use an official Python runtime
+# ================================
+FROM python:3.11-slim
+# Prevent Python from writing pyc files & buffering stdout/stderr
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+# ================================
+# 2. Set working directory
+# ================================
+WORKDIR /app
+# ================================
+# 3. Install OS dependencies (if needed)
+# ================================
+# Add any OS packages here if your code needs them (curl, gcc, etc.)
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+ && rm -rf /var/lib/apt/lists/*
+# ================================
+# 4. Install Python dependencies
+# ================================
+# Copy only requirements first for better caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# ================================
+# 5. Copy project files
+# ================================
+COPY . .
+# ================================
+# 6. Expose the FastAPI port
+# ================================
+EXPOSE 7860
+# --- OR ---
+CMD ["uvicorn", "backend.api:app", "--host", "0.0.0.0", "--port", "7860"]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Subhakanta Rath
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,109 @@

+# 🌀 Odisha Disaster Management RAG Chatbot
+## 📌 Overview
+Odisha faces recurring disasters every year such as **floods, cyclones, and droughts**.
+While the state has a strong disaster management authority (OSDMA), information is often scattered across reports, research papers, and government documents.
+This project builds a **Retrieval-Augmented Generation (RAG) based chatbot** that provides citizens, researchers, and policymakers with **clear, reliable, and contextual answers** related to Odisha’s disaster management practices.
+---
+## ✨ Features
+-  Handles **132 PDFs** and **12 text files** (OSDMA, IMD, NDMA, research papers).
+-  **Preprocessing pipeline**: PDF/text extraction, cleaning, normalization, chunking.
+- **Embeddings** with `sentence-transformers/all-MiniLM-L6-v2`.
+-  **FAISS Vector Database** for fast and efficient retrieval.
+-  **RAG pipeline**:
+  1. User query → query structuring (handles poor English, spelling issues).
+  2. Retrieve relevant chunks from FAISS.
+  3. If no relevant results → no LLM call (saves cost).
+  4. If relevant → LLM generates structured, contextual answers.
+-  **Prompt engineering** for better accuracy and reduced hallucinations.
+-  Backend: **FastAPI**.
+-  Frontend: **HTML, CSS, JS chatbot interface**.
+---
+## 🏗️ Architecture
+ **User Query → Query Structuring → FAISS Retriever → Relevant Chunks → LLM → Answer**
+# 🛠️ Tech Stack
+-  **Python** (data handling & backend)
+-  **PyPDF, TextLoader** → PDF/Text extraction
+-  **FAISS** → Vector database
+-  **HuggingFace Sentence Transformers** → Embeddings
+-  **FastAPI** → Backend API
+-  **HTML, CSS, JavaScript** → Frontend chatbot UI
+-  **LLM (OpenAI / HuggingFace)** → Answer generation
+---
+## ⚙️ Installation
+### 1. Clone the repository
+```bash
+git clone https://github.com/subhakanta156/odisha-disaster-knowledge-assistant.git
+```
+### 2. Create virtual environment & install dependencies
+```bash
+python -m venv venv
+source venv/bin/activate   # Linux/Mac
+venv\Scripts\activate      # Windows
+pip install -r requirements.txt
+```
+### 3. Prepare the data
+- Place all PDFs/text files inside the data/ folder.
+- Run preprocessing & embedding script:
+```bash
+python scripts/build_vector_store.py
+```
+### 4. Run the FastAPI backend
+```bash
+uvicorn app.main:app --reload
+```
+### 5. Open the frontend
+- Open `frontend/index.html` in your browser.
+## 🚀 Usage
+Ask questions like:
+- “How does Odisha’s disaster proneness compare with other Indian states?”
+- “Provide details of relief funds sanctioned for Odisha during the 1999 Super Cyclone.”
+- “Which Odisha agency is primarily responsible for issuing cyclone alerts?”
+- “Explain the key steps taken by the Odisha government if lives are lost in a disaster?”
+The system retrieves relevant chunks from reports and generates reliable, structured answers.
+---
+## 📊 Optimizations
+-  Added query filtering → No LLM call if retrieval fails (reduces cost).
+-  Handled poor English queries via query restructuring.
+-  Improved prompt engineering to minimize hallucinations.
+---
+## 📌 Future Improvements
+-  Add multilingual support (Odia/Hindi queries).
+-  Deploy on cloud (AWS/GCP/Azure) with Docker.
+-  Use advanced embeddings (e.g., `all-mpnet-base-v2`) for higher accuracy.
+-  Add real-time updates (e.g., cyclone alerts).
+---
+## 👨‍💻 Author
+**Subhakanta Rath**
+MSc AI & ML @ IIIT Lucknow
+Passionate about AI/ML, Data Engineering

app.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# run.py at E:\odisha_disaster_chatbot
+import uvicorn
+from pathlib import Path
+import sys
+# ensure project root on sys.path
+BASE_DIR = Path(__file__).resolve().parent
+sys.path.append(str(BASE_DIR))
+if __name__ == "__main__":
+    # run the app from backend/api.py
+    uvicorn.run("backend.api:app", host="0.0.0.0", port=8000, reload=True)

backend/api.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import sys
+import os
+# add project root (E:\odisha_disaster_chatbot) to Python path
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+from backend.models import ChatRequest, ChatResponse   # in backend/
+from src.chatbot import RAGChatBot
+app = FastAPI(title="Odisha Disaster Management Chatbot")
+# ✅ Allow frontend to talk to backend
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Initialize chatbot once (not per request)
+bot = RAGChatBot()
+@app.post("/chat", response_model=ChatResponse)
+def chat(request: ChatRequest):
+    answer = bot.chat(request.query)
+    return ChatResponse(answer=answer)
+@app.get("/")
+def root():
+    return {"message": "✅ Odisha Disaster Management Chatbot API is running"}

backend/models.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from pydantic import BaseModel
+class ChatRequest(BaseModel):
+    query: str
+class ChatResponse(BaseModel):
+    answer: str

frontend/index.html ADDED Viewed

	@@ -0,0 +1,22 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Odisha Disaster Chatbot</title>
+    <link rel="stylesheet" href="style.css">
+</head>
+<body>
+    <div class="chat-container">
+        <h2>Odisha Disaster Chatbot</h2>
+        <div id="chatbox" class="chatbox"></div>
+        <div class="input-area">
+            <input type="text" id="query" placeholder="Ask something..."
+                   onkeydown="if(event.key==='Enter') askBot()" />
+            <button onclick="askBot()">Send</button>
+        </div>
+    </div>
+    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
+    <script src="script.js"></script>
+</body>
+</html>

frontend/script.js ADDED Viewed

	@@ -0,0 +1,37 @@

+async function askBot() {
+    let query = document.getElementById("query").value;
+    if (!query.trim()) return;
+    // Show user message
+    addMessage(query, "user");
+    // Clear input field
+    document.getElementById("query").value = "";
+    try {
+        let res = await fetch("http://127.0.0.1:8000/chat", {
+            method: "POST",
+            headers: { "Content-Type": "application/json" },
+            body: JSON.stringify({ query: query })
+        });
+        let data = await res.json();
+        // Show bot response
+        addMessage(data.answer, "bot");
+    } catch (error) {
+        addMessage("⚠️ Unable to reach server. Please try again later.", "bot");
+    }
+}
+function addMessage(text, sender) {
+    let chatbox = document.getElementById("chatbox");
+    let msg = document.createElement("div");
+    msg.classList.add("message", sender);
+    // msg.textContent = text;
+    msg.innerHTML = marked.parse(text);
+    chatbox.appendChild(msg);
+    // Auto-scroll to the bottom
+    chatbox.scrollTop = chatbox.scrollHeight;
+}

frontend/style.css ADDED Viewed

	@@ -0,0 +1,86 @@

+body {
+    font-family: Arial, sans-serif;
+    background: #f0f2f5;
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    height: 100vh;
+    margin: 0;
+}
+.chat-container {
+    width: 400px;
+    background: #fff;
+    border-radius: 12px;
+    box-shadow: 0 4px 12px rgba(0,0,0,0.1);
+    display: flex;
+    flex-direction: column;
+    overflow: hidden;
+}
+.chat-container h2 {
+    background: #007bff;
+    color: white;
+    padding: 15px;
+    margin: 0;
+    text-align: center;
+}
+.chatbox {
+    flex: 1;
+    padding: 15px;
+    overflow-y: auto;
+    display: flex;
+    flex-direction: column;
+    gap: 10px;
+}
+.message {
+    max-width: 75%;
+    padding: 10px 15px;
+    border-radius: 15px;
+    word-wrap: break-word;
+}
+.user {
+    align-self: flex-end;
+    background: #007bff;
+    color: white;
+    border-bottom-right-radius: 5px;
+}
+.bot {
+    align-self: flex-start;
+    background: #e4e6eb;
+    color: black;
+    border-bottom-left-radius: 5px;
+}
+.input-area {
+    display: flex;
+    padding: 10px;
+    border-top: 1px solid #ddd;
+    background: #fafafa;
+}
+.input-area input {
+    flex: 1;
+    padding: 10px;
+    border: 1px solid #ddd;
+    border-radius: 8px;
+    outline: none;
+    margin-right: 10px;
+}
+.input-area button {
+    padding: 10px 15px;
+    border: none;
+    border-radius: 8px;
+    background: #007bff;
+    color: white;
+    cursor: pointer;
+}
+.input-area button:hover {
+    background: #0056b3;
+}

src/chatbot.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import os
+from typing import List
+from dotenv import load_dotenv
+from langchain_groq import ChatGroq
+from langchain.schema import HumanMessage, AIMessage
+from langchain_community.vectorstores import FAISS
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain.prompts import PromptTemplate
+from langchain.chains import RetrievalQA
+# ---------------------------
+# Load environment variables
+# ---------------------------
+load_dotenv()
+GROQ_API_KEY = os.getenv("GROQ_API_KEY")
+# ---------------------------
+# Settings / Tuning
+# ---------------------------
+DB_FAISS_PATH = "vectorStore"
+EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+K = 5                    # how many candidates to check for pre-filter
+MAX_DISTANCE = 1.0       # FAISS distance threshold (lower = better).
+MAX_CHAT_HISTORY = 50    # cap chat history to avoid unbounded growth
+# ---------------------------
+# Load FAISS VectorStore
+# ---------------------------
+embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)
+db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True)
+print(f"✅ FAISS index size: {db.index.ntotal}")
+# ---------------------------
+# ChatBot Class
+# ---------------------------
+class RAGChatBot:
+    def __init__(self):
+        # LLM
+        if not GROQ_API_KEY:
+            raise ValueError("GROQ_API_KEY not set in environment")
+        self.llm = ChatGroq(
+            groq_api_key=GROQ_API_KEY,
+            model="llama-3.1-8b-instant",
+            temperature=0
+        )
+        self.chat_history: List = []
+        # Retriever used by RetrievalQA (kept, but we will pre-filter before calling the chain)
+        self.retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})
+        # Custom Prompt (dynamic fallback included)
+        custom_prompt = """
+Use the following context to answer the user’s question.
+If the answer cannot be found in the context, reply exactly with:
+"I'm trained only on Odisha disaster management reports (i.e,OSDMA, NDMA, IMD, Research papers). I don't have any information about: '{question}'"
+Context:
+{context}
+Question:
+{question}
+Answer:
+"""
+        self.prompt = PromptTemplate(template=custom_prompt, input_variables=["context", "question"])
+        # Retrieval QA Chain (keeps structured QA behavior)
+        self.qa_chain = RetrievalQA.from_chain_type(
+            llm=self.llm,
+            retriever=self.retriever,
+            return_source_documents=True,
+            chain_type_kwargs={"prompt": self.prompt}
+        )
+    # ---------------------------
+    # NEW: Rewrite function
+    # ---------------------------
+    def rewrite_query(self, user_input: str) -> str:
+        """Rewrite query into formal disaster-management style language using LLM."""
+        rewrite_prompt = f"""
+        Rewrite the following user query into clear, formal disaster management language
+        as used in government reports (OSDMA, NDMA, IMD).
+        If it is not disaster-related, just return it unchanged.
+        Query: {user_input}
+        """
+        try:
+            response = self.llm.invoke([HumanMessage(content=rewrite_prompt)])
+            return response.content.strip()
+        except Exception as e:
+            print("⚠ Rewrite error:", e)
+            return user_input  # fallback to original
+    def _prefilter_by_distance(self, query: str, k: int = K, max_distance: float = MAX_DISTANCE) -> bool:
+        """Check if query is in-domain using FAISS distance."""
+        results = db.similarity_search_with_score(query, k=k)
+        if not results:
+            return False
+        best_score = results[0][1]  # (Document, score)
+        return best_score <= max_distance
+    def chat(self, user_input: str) -> str:
+        # 1) Rewrite user query
+        rewritten_query = self.rewrite_query(user_input)
+        # print(f"[debug] rewritten query: {rewritten_query}")
+        # 2) Quick in-domain prefilter
+        try:
+            in_domain = self._prefilter_by_distance(rewritten_query)
+        except Exception as e:
+            print("⚠ prefilter error:", e)
+            in_domain = True
+        if not in_domain:
+            return (
+                f"I’m trained only on Odisha disaster management reports "
+                f"(OSDMA, NDMA, IMD, research). I don’t have any information about: '{user_input}'."
+            )
+        # 3) Retrieval + QA
+        try:
+            response = self.qa_chain.invoke({"query": rewritten_query})
+            answer = response.get("result") if isinstance(response, dict) else str(response)
+        except Exception as e:
+            print("⚠ LLM / chain error:", e)
+            answer = "Sorry, I encountered an error while generating the answer."
+        # 4) Update memory (bounded)
+        self.chat_history.append(HumanMessage(content=user_input))
+        self.chat_history.append(AIMessage(content=answer))
+        if len(self.chat_history) > MAX_CHAT_HISTORY * 2:
+            self.chat_history = self.chat_history[-MAX_CHAT_HISTORY * 2 :]
+        return answer
+# ---------------------------
+# Run Chatbot (CLI)
+# ---------------------------
+if __name__ == "__main__":
+    bot = RAGChatBot()
+    print("🤖 Odisha Disaster Management ChatBot ready! Type 'exit' to quit.")
+    while True:
+        query = input("You: ")
+        if query.lower() in ["exit", "quit"]:
+            break
+        print("Bot:", bot.chat(query))

src/check_index.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_community.vectorstores import FAISS
+# Path of  vectorstore
+DB_FAISS_PATH = "vectorStore"
+def check_faiss_index():
+    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
+    db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True)
+    # Number of vectors stored in index.faiss
+    num_vectors = db.index.ntotal
+    # Number of documents (with metadata) stored in index.pkl
+    num_docs = len(db.docstore._dict)
+    print(f"📦 index.faiss contains {num_vectors} vectors")
+    print(f"📑 index.pkl contains {num_docs} metadata entries")
+if __name__ == "__main__":
+    check_faiss_index()

src/ingest.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import os
+import re
+from dotenv import load_dotenv
+from langchain_core.documents import Document
+from langchain_community.document_loaders import PyPDFLoader, TextLoader
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_community.vectorstores import FAISS
+# Load environment variables from .env file
+load_dotenv()
+# Define the path for the FAISS vector store
+DB_FAISS_PATH = 'vectorStore'
+def clean_text(text):
+    """Clean messy headers/footers and normalize spacing."""
+    text = re.sub(r'\n\s*\n', '\n\n', text)  # collapse multiple newlines
+    lines = text.split('\n')
+    cleaned_lines = []
+    for line in lines:
+        if sum(c.isalpha() for c in line) > 5:  # keep if more than 5 letters
+            cleaned_lines.append(line)
+    text = '\n'.join(cleaned_lines)
+    text = re.sub(r'\s+', ' ', text).strip()  # normalize spaces
+    return text
+def load_documents():
+    """Manually load PDF and text documents from the 'data/' folder with proper encoding."""
+    data_dir = '../data'
+    documents = []
+    for root, _, files in os.walk(data_dir):
+        for file in files:
+            file_path = os.path.join(root, file)
+            if file.lower().endswith('.pdf'):
+                loader = PyPDFLoader(file_path)
+                print(f"Loading PDF {file_path}")
+                documents.extend(loader.load())
+            elif file.lower().endswith('.txt'):
+                print(f"Loading TXT {file_path}")
+                try:
+                    with open(file_path, 'r', encoding='utf-8') as f:
+                        text = f.read()
+                    documents.append(Document(page_content=text, metadata={"source": file_path}))
+                except UnicodeDecodeError as e:
+                    print(f"⚠ Skipping {file_path} due to encoding error: {e}")
+            else:
+                continue
+    return documents
+def create_vector_db():
+    print("Step 1: Loading documents from the 'data/' directory...")
+    documents = load_documents()
+    if not documents:
+        print("No documents found in the 'data' directory. Exiting.")
+        return
+    print(f"Loaded {len(documents)} document(s).")
+    print("\nStep 2: Cleaning the text content...")
+    for doc in documents:
+        doc.page_content = clean_text(doc.page_content)
+    print("Text cleaning complete.")
+    print("\nStep 3: Splitting into chunks...")
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=1000,
+        chunk_overlap=100
+    )
+    chunks = text_splitter.split_documents(documents)
+    print(f"Created {len(chunks)} chunks.")
+    print("\nStep 4: Creating embeddings with HuggingFace...")
+    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
+    print("Step 5: Building FAISS index...")
+    db = FAISS.from_documents(chunks, embeddings)
+    db.save_local(DB_FAISS_PATH)
+    print(f"\n✅ Ingestion complete! Vector store saved at '{DB_FAISS_PATH}'")
+if __name__ == "__main__":
+    create_vector_db()

src/query.py ADDED Viewed

	@@ -0,0 +1,47 @@

+from langchain_community.vectorstores import FAISS
+from langchain_huggingface import HuggingFaceEmbeddings
+# Load embeddings and vector DB
+embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
+db = FAISS.load_local("vectorStore", embeddings, allow_dangerous_deserialization=True)
+# Ask a query
+query = "how many cyclone struck odisha in 2015"
+results = db.similarity_search_with_score(query, k=5)
+# Combine results into one paragraph
+# answer = " ".join([doc.page_content for doc in results])
+# Apply similarity threshold
+THRESHOLD = 0.75
+filtered = []
+for doc, score in results:
+    print(f"🔎 Retrieved (distance={score:.4f}): {doc.metadata}")  # debug
+    if score <= THRESHOLD:   # <-- check for "closer than threshold"
+        filtered.append(doc)
+if not filtered:
+    answer = "I don't know. This information is not available in my knowledge base."
+else:
+    answer = "\n\n".join([doc.page_content for doc in filtered])
+print(f"\n🔍 Query: {query}")
+print(f"\n📝 Answer:\n{answer}")
+# for i, doc in enumerate(results, 1):
+#     print(f"\nResult {i}:")
+#     print(f"Source: {doc.metadata}")
+#     print(f"Content: {doc.page_content[:300]}...")
+# it gives query aswell as metadata  also
+# print(f"\n🔍 Query: {query}")
+# for i, doc in enumerate(results, 2):
+#    print(f"\n--- Result {i} ---")
+#    print(doc.page_content[:700])   # show first 500 characters
+#    print("Source:", doc.metadata)