Spaces:
Sleeping
Sleeping
A newer version of the Streamlit SDK is available: 1.55.0
metadata
title: SCDM Chatbot App
emoji: 🚀
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: app.py
pinned: false
license: mit
SCDM Chatbot (Streamlit + LangChain + Groq)
ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in data/pdf/, always showing clear, human-readable sources (document title, page, and a clickable link from data/source_links.json).
Features
- Q&A with retrieval-augmented generation (RAG) and readable citations
- Summarization (single or multi-document context)
- Quiz generation (MCQs with answers, explanations, and citations)
- “Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
- Clean source display: full paragraph block quotes, with title + page + link
Requirements
- Python 3.10–3.12 recommended
- A Groq API key (
GROQ_API_KEY) - macOS/Linux/Windows (CPU only; no GPU required)
Quickstart
- Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
- Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
- Configure environment
cp .env.example .env
# Edit .env and set: GROQ_API_KEY=your_key_here
- Build the index (extracts paragraphs with page metadata and embeds them)
python ingest.py
- Run the app
streamlit run app.py
Usage
- Select a model in the sidebar (default:
llama-3.3-70b-versatile; also available:llama-3.1-8b-instant). - Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
- Ask things like:
- “Tell me about CDM to CDS”
- “Summarize the key QbD responsibilities for CDS and cite sources.”
- “Create a 5-question quiz on RBQM with citations.”
- Sources appear below each answer as expanders with:
- Document title and page number
- Clickable URL like
...pdf#page=10 - Full paragraph block quotes for readability
Adding/Updating Documents
- Place PDFs in
data/pdf/. - Add/update entries in
data/source_links.jsonwith the PDF file name → public link mapping. - Rebuild the index:
python ingest.py
Project Structure
scdm_chatbot/
app.py # Streamlit UI and chains (Q&A, Summarize, Quiz)
ingest.py # PDF → paragraph extraction → FAISS index
requirements.txt # Python dependencies
.env.example # Env var template (GROQ_API_KEY)
data/
pdf/ # Input PDFs
source_links.json # File name → source URL mapping
index/ # Generated FAISS index and manifest
user_requirements.txt # Problem statement and expected use cases
Troubleshooting
- Groq error mentioning
reasoning_formatorCompletions.create: update packages
pip install --upgrade groq langchain-groq langchain
Vector index not found: run ingestion
python ingest.py
GROQ_API_KEY is not set: configure.envor export the variable
export GROQ_API_KEY=your_key_here
- PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.
Notes on Citations
- The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
- Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from
data/source_links.json.
Commands Cheat Sheet
# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # set GROQ_API_KEY
# Index and run
python ingest.py
streamlit run app.py
# Update core libs if needed
pip install --upgrade groq langchain-groq langchain