Spaces:

TrizteX
/

SCDM-chatbot

Sleeping

App Files Files Community

SCDM-chatbot / README.md

TrizteX

Upload 40 files

31fd087 verified 7 months ago

preview code

raw

history blame contribute delete

3.8 kB

	---
	title: SCDM Chatbot App
	emoji: 🚀
	colorFrom: indigo
	colorTo: pink
	sdk: streamlit
	sdk_version: 1.36.0
	app_file: app.py
	pinned: false
	license: mit
	---

	## SCDM Chatbot (Streamlit + LangChain + Groq)

	ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in `data/pdf/`, always showing clear, human-readable sources (document title, page, and a clickable link from `data/source_links.json`).

	### Features
	- Q&A with retrieval-augmented generation (RAG) and readable citations
	- Summarization (single or multi-document context)
	- Quiz generation (MCQs with answers, explanations, and citations)
	- “Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
	- Clean source display: full paragraph block quotes, with title + page + link

	### Requirements
	- Python 3.10–3.12 recommended
	- A Groq API key (`GROQ_API_KEY`)
	- macOS/Linux/Windows (CPU only; no GPU required)

	### Quickstart
	1) Create a virtual environment
	```bash
	python3 -m venv .venv
	source .venv/bin/activate # Windows: .venv\Scripts\activate
	```

	2) Install dependencies
	```bash
	pip install --upgrade pip
	pip install -r requirements.txt
	```

	3) Configure environment
	```bash
	cp .env.example .env
	# Edit .env and set: GROQ_API_KEY=your_key_here
	```

	4) Build the index (extracts paragraphs with page metadata and embeds them)
	```bash
	python ingest.py
	```

	5) Run the app
	```bash
	streamlit run app.py
	```

	### Usage
	- Select a model in the sidebar (default: `llama-3.3-70b-versatile`; also available: `llama-3.1-8b-instant`).
	- Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
	- Ask things like:
	- “Tell me about CDM to CDS”
	- “Summarize the key QbD responsibilities for CDS and cite sources.”
	- “Create a 5-question quiz on RBQM with citations.”
	- Sources appear below each answer as expanders with:
	- Document title and page number
	- Clickable URL like `...pdf#page=10`
	- Full paragraph block quotes for readability

	### Adding/Updating Documents
	1) Place PDFs in `data/pdf/`.
	2) Add/update entries in `data/source_links.json` with the PDF file name → public link mapping.
	3) Rebuild the index:
	```bash
	python ingest.py
	```

	### Project Structure
	```
	scdm_chatbot/
	app.py # Streamlit UI and chains (Q&A, Summarize, Quiz)
	ingest.py # PDF → paragraph extraction → FAISS index
	requirements.txt # Python dependencies
	.env.example # Env var template (GROQ_API_KEY)
	data/
	pdf/ # Input PDFs
	source_links.json # File name → source URL mapping
	index/ # Generated FAISS index and manifest
	user_requirements.txt # Problem statement and expected use cases
	```

	### Troubleshooting
	- Groq error mentioning `reasoning_format` or `Completions.create`: update packages
	```bash
	pip install --upgrade groq langchain-groq langchain
	```

	- `Vector index not found`: run ingestion
	```bash
	python ingest.py
	```

	- `GROQ_API_KEY is not set`: configure `.env` or export the variable
	```bash
	export GROQ_API_KEY=your_key_here
	```

	- PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.

	### Notes on Citations
	- The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
	- Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from `data/source_links.json`.

	### Commands Cheat Sheet
	```bash
	# Setup
	python3 -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt
	cp .env.example .env # set GROQ_API_KEY

	# Index and run
	python ingest.py
	streamlit run app.py

	# Update core libs if needed
	pip install --upgrade groq langchain-groq langchain
	```