Spaces:

TrizteX
/

SCDM-chatbot

Sleeping

File size: 3,803 Bytes

---
title: SCDM Chatbot App
emoji: 🚀
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: app.py
pinned: false
license: mit
---

## SCDM Chatbot (Streamlit + LangChain + Groq)

ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in `data/pdf/`, always showing clear, human-readable sources (document title, page, and a clickable link from `data/source_links.json`).

### Features
- Q&A with retrieval-augmented generation (RAG) and readable citations
- Summarization (single or multi-document context)
- Quiz generation (MCQs with answers, explanations, and citations)
- “Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
- Clean source display: full paragraph block quotes, with title + page + link

### Requirements
- Python 3.10–3.12 recommended
- A Groq API key (`GROQ_API_KEY`)
- macOS/Linux/Windows (CPU only; no GPU required)

### Quickstart
1) Create a virtual environment
```bash
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
```

2) Install dependencies
```bash
pip install --upgrade pip
pip install -r requirements.txt
```

3) Configure environment
```bash
cp .env.example .env
# Edit .env and set: GROQ_API_KEY=your_key_here
```

4) Build the index (extracts paragraphs with page metadata and embeds them)
```bash
python ingest.py
```

5) Run the app
```bash
streamlit run app.py
```

### Usage
- Select a model in the sidebar (default: `llama-3.3-70b-versatile`; also available: `llama-3.1-8b-instant`).
- Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
- Ask things like:
  - “Tell me about CDM to CDS”
  - “Summarize the key QbD responsibilities for CDS and cite sources.”
  - “Create a 5-question quiz on RBQM with citations.”
- Sources appear below each answer as expanders with:
  - Document title and page number
  - Clickable URL like `...pdf#page=10`
  - Full paragraph block quotes for readability

### Adding/Updating Documents
1) Place PDFs in `data/pdf/`.
2) Add/update entries in `data/source_links.json` with the PDF file name → public link mapping.
3) Rebuild the index:
```bash
python ingest.py
```

### Project Structure
```
scdm_chatbot/
  app.py                  # Streamlit UI and chains (Q&A, Summarize, Quiz)
  ingest.py               # PDF → paragraph extraction → FAISS index
  requirements.txt        # Python dependencies
  .env.example            # Env var template (GROQ_API_KEY)
  data/
    pdf/                  # Input PDFs
    source_links.json     # File name → source URL mapping
    index/                # Generated FAISS index and manifest
  user_requirements.txt   # Problem statement and expected use cases
```

### Troubleshooting
- Groq error mentioning `reasoning_format` or `Completions.create`: update packages
```bash
pip install --upgrade groq langchain-groq langchain
```

- `Vector index not found`: run ingestion
```bash
python ingest.py
```

- `GROQ_API_KEY is not set`: configure `.env` or export the variable
```bash
export GROQ_API_KEY=your_key_here
```

- PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.

### Notes on Citations
- The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
- Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from `data/source_links.json`.

### Commands Cheat Sheet
```bash
# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # set GROQ_API_KEY

# Index and run
python ingest.py
streamlit run app.py

# Update core libs if needed
pip install --upgrade groq langchain-groq langchain
```