Spaces:

TrizteX
/

SCDM-chatbot

Sleeping

App Files Files Community

SCDM-chatbot / README.md

TrizteX

Upload 40 files

31fd087 verified 6 months ago

preview code

raw

history blame contribute delete

3.8 kB

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

metadata

title: SCDM Chatbot App
emoji: 🚀
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: app.py
pinned: false
license: mit

SCDM Chatbot (Streamlit + LangChain + Groq)

ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in data/pdf/, always showing clear, human-readable sources (document title, page, and a clickable link from data/source_links.json).

Features

Q&A with retrieval-augmented generation (RAG) and readable citations
Summarization (single or multi-document context)
Quiz generation (MCQs with answers, explanations, and citations)
“Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
Clean source display: full paragraph block quotes, with title + page + link

Requirements

Python 3.10–3.12 recommended
A Groq API key (GROQ_API_KEY)
macOS/Linux/Windows (CPU only; no GPU required)

Quickstart

Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Configure environment

cp .env.example .env
# Edit .env and set: GROQ_API_KEY=your_key_here

Build the index (extracts paragraphs with page metadata and embeds them)

python ingest.py

Run the app

streamlit run app.py

Usage

Select a model in the sidebar (default: llama-3.3-70b-versatile; also available: llama-3.1-8b-instant).
Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
Ask things like:
- “Tell me about CDM to CDS”
- “Summarize the key QbD responsibilities for CDS and cite sources.”
- “Create a 5-question quiz on RBQM with citations.”
Sources appear below each answer as expanders with:
- Document title and page number
- Clickable URL like ...pdf#page=10
- Full paragraph block quotes for readability

Adding/Updating Documents

Place PDFs in data/pdf/.
Add/update entries in data/source_links.json with the PDF file name → public link mapping.
Rebuild the index:

python ingest.py

Project Structure

scdm_chatbot/
  app.py                  # Streamlit UI and chains (Q&A, Summarize, Quiz)
  ingest.py               # PDF → paragraph extraction → FAISS index
  requirements.txt        # Python dependencies
  .env.example            # Env var template (GROQ_API_KEY)
  data/
    pdf/                  # Input PDFs
    source_links.json     # File name → source URL mapping
    index/                # Generated FAISS index and manifest
  user_requirements.txt   # Problem statement and expected use cases

Troubleshooting

Groq error mentioning reasoning_format or Completions.create: update packages

pip install --upgrade groq langchain-groq langchain

Vector index not found: run ingestion

python ingest.py

GROQ_API_KEY is not set: configure .env or export the variable

export GROQ_API_KEY=your_key_here

PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.

Notes on Citations

The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from data/source_links.json.

Commands Cheat Sheet

# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # set GROQ_API_KEY

# Index and run
python ingest.py
streamlit run app.py

# Update core libs if needed
pip install --upgrade groq langchain-groq langchain