SCDM-chatbot / README.md
TrizteX's picture
Upload 40 files
31fd087 verified
---
title: SCDM Chatbot App
emoji: 🚀
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: app.py
pinned: false
license: mit
---
## SCDM Chatbot (Streamlit + LangChain + Groq)
ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in `data/pdf/`, always showing clear, human-readable sources (document title, page, and a clickable link from `data/source_links.json`).
### Features
- Q&A with retrieval-augmented generation (RAG) and readable citations
- Summarization (single or multi-document context)
- Quiz generation (MCQs with answers, explanations, and citations)
- “Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
- Clean source display: full paragraph block quotes, with title + page + link
### Requirements
- Python 3.10–3.12 recommended
- A Groq API key (`GROQ_API_KEY`)
- macOS/Linux/Windows (CPU only; no GPU required)
### Quickstart
1) Create a virtual environment
```bash
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
```
2) Install dependencies
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
3) Configure environment
```bash
cp .env.example .env
# Edit .env and set: GROQ_API_KEY=your_key_here
```
4) Build the index (extracts paragraphs with page metadata and embeds them)
```bash
python ingest.py
```
5) Run the app
```bash
streamlit run app.py
```
### Usage
- Select a model in the sidebar (default: `llama-3.3-70b-versatile`; also available: `llama-3.1-8b-instant`).
- Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
- Ask things like:
- “Tell me about CDM to CDS”
- “Summarize the key QbD responsibilities for CDS and cite sources.”
- “Create a 5-question quiz on RBQM with citations.”
- Sources appear below each answer as expanders with:
- Document title and page number
- Clickable URL like `...pdf#page=10`
- Full paragraph block quotes for readability
### Adding/Updating Documents
1) Place PDFs in `data/pdf/`.
2) Add/update entries in `data/source_links.json` with the PDF file name → public link mapping.
3) Rebuild the index:
```bash
python ingest.py
```
### Project Structure
```
scdm_chatbot/
app.py # Streamlit UI and chains (Q&A, Summarize, Quiz)
ingest.py # PDF → paragraph extraction → FAISS index
requirements.txt # Python dependencies
.env.example # Env var template (GROQ_API_KEY)
data/
pdf/ # Input PDFs
source_links.json # File name → source URL mapping
index/ # Generated FAISS index and manifest
user_requirements.txt # Problem statement and expected use cases
```
### Troubleshooting
- Groq error mentioning `reasoning_format` or `Completions.create`: update packages
```bash
pip install --upgrade groq langchain-groq langchain
```
- `Vector index not found`: run ingestion
```bash
python ingest.py
```
- `GROQ_API_KEY is not set`: configure `.env` or export the variable
```bash
export GROQ_API_KEY=your_key_here
```
- PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.
### Notes on Citations
- The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
- Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from `data/source_links.json`.
### Commands Cheat Sheet
```bash
# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # set GROQ_API_KEY
# Index and run
python ingest.py
streamlit run app.py
# Update core libs if needed
pip install --upgrade groq langchain-groq langchain
```