--- title: SCDM Chatbot App emoji: 🚀 colorFrom: indigo colorTo: pink sdk: streamlit sdk_version: 1.36.0 app_file: app.py pinned: false license: mit --- ## SCDM Chatbot (Streamlit + LangChain + Groq) ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in `data/pdf/`, always showing clear, human-readable sources (document title, page, and a clickable link from `data/source_links.json`). ### Features - Q&A with retrieval-augmented generation (RAG) and readable citations - Summarization (single or multi-document context) - Quiz generation (MCQs with answers, explanations, and citations) - “Auto” intent routing (classifies input to Q&A / Summarize / Quiz) - Clean source display: full paragraph block quotes, with title + page + link ### Requirements - Python 3.10–3.12 recommended - A Groq API key (`GROQ_API_KEY`) - macOS/Linux/Windows (CPU only; no GPU required) ### Quickstart 1) Create a virtual environment ```bash python3 -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate ``` 2) Install dependencies ```bash pip install --upgrade pip pip install -r requirements.txt ``` 3) Configure environment ```bash cp .env.example .env # Edit .env and set: GROQ_API_KEY=your_key_here ``` 4) Build the index (extracts paragraphs with page metadata and embeds them) ```bash python ingest.py ``` 5) Run the app ```bash streamlit run app.py ``` ### Usage - Select a model in the sidebar (default: `llama-3.3-70b-versatile`; also available: `llama-3.1-8b-instant`). - Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent. - Ask things like: - “Tell me about CDM to CDS” - “Summarize the key QbD responsibilities for CDS and cite sources.” - “Create a 5-question quiz on RBQM with citations.” - Sources appear below each answer as expanders with: - Document title and page number - Clickable URL like `...pdf#page=10` - Full paragraph block quotes for readability ### Adding/Updating Documents 1) Place PDFs in `data/pdf/`. 2) Add/update entries in `data/source_links.json` with the PDF file name → public link mapping. 3) Rebuild the index: ```bash python ingest.py ``` ### Project Structure ``` scdm_chatbot/ app.py # Streamlit UI and chains (Q&A, Summarize, Quiz) ingest.py # PDF → paragraph extraction → FAISS index requirements.txt # Python dependencies .env.example # Env var template (GROQ_API_KEY) data/ pdf/ # Input PDFs source_links.json # File name → source URL mapping index/ # Generated FAISS index and manifest user_requirements.txt # Problem statement and expected use cases ``` ### Troubleshooting - Groq error mentioning `reasoning_format` or `Completions.create`: update packages ```bash pip install --upgrade groq langchain-groq langchain ``` - `Vector index not found`: run ingestion ```bash python ingest.py ``` - `GROQ_API_KEY is not set`: configure `.env` or export the variable ```bash export GROQ_API_KEY=your_key_here ``` - PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers. ### Notes on Citations - The app displays sources as human-readable cards with full paragraphs to avoid broken chunks. - Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from `data/source_links.json`. ### Commands Cheat Sheet ```bash # Setup python3 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt cp .env.example .env # set GROQ_API_KEY # Index and run python ingest.py streamlit run app.py # Update core libs if needed pip install --upgrade groq langchain-groq langchain ```