SCDM-chatbot / README.md
TrizteX's picture
Upload 40 files
31fd087 verified

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade
metadata
title: SCDM Chatbot App
emoji: 🚀
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: app.py
pinned: false
license: mit

SCDM Chatbot (Streamlit + LangChain + Groq)

ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in data/pdf/, always showing clear, human-readable sources (document title, page, and a clickable link from data/source_links.json).

Features

  • Q&A with retrieval-augmented generation (RAG) and readable citations
  • Summarization (single or multi-document context)
  • Quiz generation (MCQs with answers, explanations, and citations)
  • “Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
  • Clean source display: full paragraph block quotes, with title + page + link

Requirements

  • Python 3.10–3.12 recommended
  • A Groq API key (GROQ_API_KEY)
  • macOS/Linux/Windows (CPU only; no GPU required)

Quickstart

  1. Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
  1. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
  1. Configure environment
cp .env.example .env
# Edit .env and set: GROQ_API_KEY=your_key_here
  1. Build the index (extracts paragraphs with page metadata and embeds them)
python ingest.py
  1. Run the app
streamlit run app.py

Usage

  • Select a model in the sidebar (default: llama-3.3-70b-versatile; also available: llama-3.1-8b-instant).
  • Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
  • Ask things like:
    • “Tell me about CDM to CDS”
    • “Summarize the key QbD responsibilities for CDS and cite sources.”
    • “Create a 5-question quiz on RBQM with citations.”
  • Sources appear below each answer as expanders with:
    • Document title and page number
    • Clickable URL like ...pdf#page=10
    • Full paragraph block quotes for readability

Adding/Updating Documents

  1. Place PDFs in data/pdf/.
  2. Add/update entries in data/source_links.json with the PDF file name → public link mapping.
  3. Rebuild the index:
python ingest.py

Project Structure

scdm_chatbot/
  app.py                  # Streamlit UI and chains (Q&A, Summarize, Quiz)
  ingest.py               # PDF → paragraph extraction → FAISS index
  requirements.txt        # Python dependencies
  .env.example            # Env var template (GROQ_API_KEY)
  data/
    pdf/                  # Input PDFs
    source_links.json     # File name → source URL mapping
    index/                # Generated FAISS index and manifest
  user_requirements.txt   # Problem statement and expected use cases

Troubleshooting

  • Groq error mentioning reasoning_format or Completions.create: update packages
pip install --upgrade groq langchain-groq langchain
  • Vector index not found: run ingestion
python ingest.py
  • GROQ_API_KEY is not set: configure .env or export the variable
export GROQ_API_KEY=your_key_here
  • PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.

Notes on Citations

  • The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
  • Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from data/source_links.json.

Commands Cheat Sheet

# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # set GROQ_API_KEY

# Index and run
python ingest.py
streamlit run app.py

# Update core libs if needed
pip install --upgrade groq langchain-groq langchain