LectureWhisperer / README.md
Raminnit's picture
Update README.md
79bbd98 verified
---
title: LectureWhisperer
emoji: πŸ†
colorFrom: gray
colorTo: green
sdk: gradio
sdk_version: 6.6.0
app_file: app.py
pinned: false
license: mit
short_description: The Lecture Whisperer is a multimodal AI study assistant.
---
title: Lecture Whisperer
emoji: πŸŽ“
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: "6.0"
app_file: app.py
pinned: false
license: mit
---
# πŸŽ“ The Lecture Whisperer
> **Turn any lecture recording + slides into a full AI-powered study toolkit β€” in minutes.**
[![Hugging Face Spaces](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10+-3776AB?logo=python&logoColor=white)](https://python.org)
[![Gradio 6.0](https://img.shields.io/badge/Gradio-6.0-orange)](https://gradio.app)
[![License: MIT](https://img.shields.io/badge/License-MIT-green)](LICENSE)
---
## πŸ“Έ Overview
The Lecture Whisperer is a multi-modal AI app that takes a raw lecture audio file and PDF slides as input and produces:
- βœ… A full timestamped transcript
- βœ… Extracted key concepts from every slide
- βœ… A smart sync report mapping *what was said* β†’ *which slide it belongs to*
- βœ… An interactive Q&A chatbot grounded in your lecture content
- βœ… A generated multiple-choice quiz for exam prep
All model inference runs via the **Hugging Face Inference API** β€” no GPU required on the Space itself.
---
## 🧠 Models Used
| Task | Model |
|---|---|
| Audio Transcription | [`openai/whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3) |
| Slide Vision & Text Extraction | [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) |
| Quiz Generation & Q&A Chatbot | [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
---
## ✨ Features
### Tab 1 β€” Upload & Process
- Upload an **MP3 or WAV** lecture recording
- Upload a **PDF** of lecture slides
- Hit **⚑ Process Lecture** to run the full pipeline
- View a live **Sync Report** showing which transcript segment maps to which slide
### Tab 2 β€” Dashboard
- **πŸ’¬ Chatbot** β€” Ask any question about the lecture. Answers are grounded in the transcript and slide content (RAG-lite)
- **πŸ–ΌοΈ Slide Gallery** β€” Browse all extracted slide images side by side
### Tab 3 β€” Mock Quiz
- Click **🧠 Generate Mock Quiz** to instantly produce 7 multiple-choice questions
- Questions are generated strictly from the lecture transcript β€” no hallucinated content
---
## πŸ”§ How It Works
### 1. Audio Transcription
Whisper Large v3 processes the audio file via the HF Inference API and returns timestamped sentence chunks:
```
[00:04] Welcome to today's lecture on classical mechanics.
[00:20] Newton's Second Law states that force equals mass times acceleration.
```
### 2. Slide Processing
Each PDF page is converted to an image using `pdf2image`. Each image is sent to Qwen2-VL with a prompt to extract all visible text, equations, bullet points, and concepts.
### 3. Sync Logic
A keyword-overlap engine indexes every slide's content into a word set. Each transcript segment is then scored against every slide β€” the highest overlap wins. Example output:
```
[04:20] Newton's Second Law states F = ma
β†’ Slide 5 (score: 4)
```
### 4. Chatbot Q&A
When you ask a question, the app:
1. Finds relevant transcript lines by keyword matching
2. Finds relevant slides by keyword matching
3. Stuffs both into a Llama-3 prompt as context
4. Returns a grounded answer
### 5. Quiz Generation
The full transcript is passed to Llama-3-8B with a strict instruction to generate MCQs *only* from the provided content β€” no external knowledge injected.
---
## πŸš€ Running Locally
### Prerequisites
- Python 3.10+
- `poppler-utils` installed on your system:
```bash
# Ubuntu / Debian
sudo apt install poppler-utils
# macOS
brew install poppler
```
### Setup
```bash
git clone https://huggingface.co/spaces/YOUR_USERNAME/lecture-whisperer
cd lecture-whisperer
pip install -r requirements.txt
```
### Set your HF Token
```bash
export HF_TOKEN=hf_your_token_here
```
### Run
```bash
python app.py
```
Then open [http://localhost:7860](http://localhost:7860) in your browser.
---
## πŸ”‘ Required Secrets (for HF Spaces)
Go to your Space β†’ **Settings β†’ Variables and secrets** β†’ add:
| Secret Name | Value |
|---|---|
| `HF_TOKEN` | Your Hugging Face API token (read access) |
Make sure you have accepted the terms for gated models:
- [Meta Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) β€” click "Agree and access repository"
- [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) β€” click "Agree and access repository"
---
## πŸ“ Project Structure
```
lecture-whisperer/
β”œβ”€β”€ app.py # Main Gradio application
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ packages.txt # System dependencies (poppler-utils)
└── README.md # This file
```
---
## πŸ“¦ Dependencies
```
gradio>=6.0.0
pdf2image>=1.17.0
Pillow>=10.0.0
requests>=2.31.0
```
System dependency (handled by `packages.txt` on HF Spaces):
```
poppler-utils
```
---
## ⚠️ Known Limitations
- **Processing time** β€” Whisper transcription via the free Inference API can take 2–5 minutes for a 1-hour lecture. The app includes automatic retry logic for cold-start delays.
- **Sync accuracy** β€” The current sync engine uses keyword overlap scoring. It works well for technical content but may miss semantic matches (e.g. paraphrased concepts). Future versions will use sentence embeddings.
- **API rate limits** β€” The HF free Inference API has rate limits. For heavy usage, consider upgrading to a PRO token or running models locally.
- **Gated models** β€” Llama-3 and Qwen2-VL require accepting license terms on the HF model page before your token can access them.
---
## πŸ—ΊοΈ Roadmap
- [ ] Sentence-embedding based sync (replace keyword overlap with `all-MiniLM-L6-v2`)
- [ ] One-click lecture summary (5 bullet points)
- [ ] Export quiz as downloadable PDF
- [ ] Speaker diarization (identify multiple speakers)
- [ ] Support for YouTube URLs as audio input
- [ ] Persistent chat history per session
---
## 🀝 Contributing
Pull requests are welcome! For major changes, please open an issue first to discuss what you'd like to change.
---
## πŸ“„ License
This project is licensed under the [MIT License](LICENSE).
---
## πŸ™ Acknowledgements
- [OpenAI Whisper](https://github.com/openai/whisper) for the transcription model
- [Qwen Team at Alibaba](https://huggingface.co/Qwen) for Qwen2-VL
- [Meta AI](https://ai.meta.com) for Llama 3
- [Hugging Face](https://huggingface.co) for the Inference API and Spaces platform
- [Gradio](https://gradio.app) for the UI framework
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference