File size: 3,803 Bytes
6a90536
31fd087
 
6a90536
31fd087
 
 
 
6a90536
31fd087
6a90536
 
31fd087
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: SCDM Chatbot App
emoji: 🚀
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: app.py
pinned: false
license: mit
---

## SCDM Chatbot (Streamlit + LangChain + Groq)

ChatGPT-like assistant for SCDM content. It answers questions, summarizes, and generates quizzes over PDFs in `data/pdf/`, always showing clear, human-readable sources (document title, page, and a clickable link from `data/source_links.json`).

### Features
- Q&A with retrieval-augmented generation (RAG) and readable citations
- Summarization (single or multi-document context)
- Quiz generation (MCQs with answers, explanations, and citations)
- “Auto” intent routing (classifies input to Q&A / Summarize / Quiz)
- Clean source display: full paragraph block quotes, with title + page + link

### Requirements
- Python 3.10–3.12 recommended
- A Groq API key (`GROQ_API_KEY`)
- macOS/Linux/Windows (CPU only; no GPU required)

### Quickstart
1) Create a virtual environment
```bash
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
```

2) Install dependencies
```bash
pip install --upgrade pip
pip install -r requirements.txt
```

3) Configure environment
```bash
cp .env.example .env
# Edit .env and set: GROQ_API_KEY=your_key_here
```

4) Build the index (extracts paragraphs with page metadata and embeds them)
```bash
python ingest.py
```

5) Run the app
```bash
streamlit run app.py
```

### Usage
- Select a model in the sidebar (default: `llama-3.3-70b-versatile`; also available: `llama-3.1-8b-instant`).
- Choose a mode: Auto, Q&A, Summarize, or Quiz. Auto attempts to classify your intent.
- Ask things like:
  - “Tell me about CDM to CDS”
  - “Summarize the key QbD responsibilities for CDS and cite sources.”
  - “Create a 5-question quiz on RBQM with citations.”
- Sources appear below each answer as expanders with:
  - Document title and page number
  - Clickable URL like `...pdf#page=10`
  - Full paragraph block quotes for readability

### Adding/Updating Documents
1) Place PDFs in `data/pdf/`.
2) Add/update entries in `data/source_links.json` with the PDF file name → public link mapping.
3) Rebuild the index:
```bash
python ingest.py
```

### Project Structure
```
scdm_chatbot/
  app.py                  # Streamlit UI and chains (Q&A, Summarize, Quiz)
  ingest.py               # PDF → paragraph extraction → FAISS index
  requirements.txt        # Python dependencies
  .env.example            # Env var template (GROQ_API_KEY)
  data/
    pdf/                  # Input PDFs
    source_links.json     # File name → source URL mapping
    index/                # Generated FAISS index and manifest
  user_requirements.txt   # Problem statement and expected use cases
```

### Troubleshooting
- Groq error mentioning `reasoning_format` or `Completions.create`: update packages
```bash
pip install --upgrade groq langchain-groq langchain
```

- `Vector index not found`: run ingestion
```bash
python ingest.py
```

- `GROQ_API_KEY is not set`: configure `.env` or export the variable
```bash
export GROQ_API_KEY=your_key_here
```

- PDF parsing issues: ensure files are valid PDFs; the app uses PyMuPDF to extract text and split into paragraphs with page numbers.

### Notes on Citations
- The app displays sources as human-readable cards with full paragraphs to avoid broken chunks.
- Citations include title, page (e.g., “(Title, p. 10)”), and a clickable link derived from `data/source_links.json`.

### Commands Cheat Sheet
```bash
# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # set GROQ_API_KEY

# Index and run
python ingest.py
streamlit run app.py

# Update core libs if needed
pip install --upgrade groq langchain-groq langchain
```