Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,293 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: docker
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: American University Academic Advisor
|
| 3 |
+
emoji: 🎓
|
| 4 |
+
colorFrom: blue
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: docker
|
| 7 |
+
sdk_version: 4.26.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: cc-by-nc-4.0
|
| 11 |
+
short_description: Research Chatbot for AU advising questions
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# American University Academic Advisor Chatbot
|
| 15 |
+
|
| 16 |
+
**This is an ongoing student research project under development for academic purposes. The data focuses on mathematics, statistics, and data science programs and is incomplete. Users are cautioned that LLM may given incomplete or incorrect answers and they should check any responses independently using authoritative sources and their advisors before making any decisions.**
|
| 17 |
+
|
| 18 |
+
A RAG (Retrieval-Augmented Generation) chatbot using Mistral 7B and ChromaDB to answer questions about American University academic programs and courses. This RAG chatbot uses information from a variety of American University public websites to provide context for user queries when they are passed to the generative AI chatbot model. The information is "retrieved" from a database of scrapped information and used to "augment" the query to the generative AI model.
|
| 19 |
+
|
| 20 |
+
The database is populated with information from the following sources at American University for the 2024-2025 Academic year.
|
| 21 |
+
|
| 22 |
+
- Academic program pages for undergraduate majors and masters degrees in the Mathematics and Statistics department.
|
| 23 |
+
- Academic program pages for undergraduate minors in the Mathematics and Statistics department.
|
| 24 |
+
- The Course Catalog for courses with identifiers: DATA, STAT, MATH, CSC, and ITEC.
|
| 25 |
+
- Undergraduate and and Graduate Academic Regulations
|
| 26 |
+
- Study Abroad introductory pages and program pages for majors in Data Science, Mathematics, Statistics, and Computer Science
|
| 27 |
+
|
| 28 |
+
## Overview
|
| 29 |
+
|
| 30 |
+
This chatbot application uses:
|
| 31 |
+
|
| 32 |
+
- **Shiny**: For the web interface
|
| 33 |
+
- **ChromaDB**: As vector database for storing and retrieving academic information
|
| 34 |
+
- **Sentence-Transformers**: For text embeddings
|
| 35 |
+
- **Mistral 7B**: For generating human-like responses
|
| 36 |
+
- **Playwright**: For web scraping academic program and course information
|
| 37 |
+
|
| 38 |
+
The application is structured to allow for scraping different types of content (academic programs, courses, minors, academic rules, study abroad options) and storing them in a vector database for efficient retrieval.
|
| 39 |
+
|
| 40 |
+
## Project Structure
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
au_advisor/
|
| 44 |
+
├── scrapers/ # Web scrapers for different data sources
|
| 45 |
+
│ ├── course_scraper.py # Scraper for course catalog
|
| 46 |
+
│ ├── program_scraper.py # Scraper for academic programs
|
| 47 |
+
│ ├── minor_scraper.py # Scraper for minor programs
|
| 48 |
+
│ ├── study_abroad_scraper.py # Scraper for study abroad options
|
| 49 |
+
│ ├── academic_rules_scraper.py # Scraper for academic regulations
|
| 50 |
+
│ └── extract_grad_regulations.py # Helper for graduate regulations
|
| 51 |
+
├── utils/ # Utility modules
|
| 52 |
+
│ ├── chroma_utils.py # ChromaDB operations
|
| 53 |
+
│ ├── config_utils.py # Configuration management
|
| 54 |
+
│ ├── scraper_utils.py # Common scraper utilities
|
| 55 |
+
│ ├── metadata_utils.py # Metadata enhancement utilities
|
| 56 |
+
│ ├── db_manager.py # Database management utilities
|
| 57 |
+
│ ├── logging_utils.py # Centralized logging
|
| 58 |
+
│ ├── chroma_explorer.py # Utility for exploring ChromaDB data
|
| 59 |
+
│ ├── directory_scan.py # Scan directory structures
|
| 60 |
+
│ └── check_dependencies.py # Dependency checker
|
| 61 |
+
├── utils_deploy/ # Deployment utilities
|
| 62 |
+
│ └── check_dependencies.py # Dependency checker for deployment
|
| 63 |
+
├── config/ # Configuration files
|
| 64 |
+
│ ├── scrapers_config.json # Scraper configurations
|
| 65 |
+
│ ├── models.txt # Embedding model mappings
|
| 66 |
+
│ ├── keys.txt # Authentication keys (template)
|
| 67 |
+
│ ├── repo_config.json # Repository configuration
|
| 68 |
+
│ ├── program_urls.txt # URLs for program scraping
|
| 69 |
+
│ ├── course_urls.txt # URLs for course scraping
|
| 70 |
+
│ ├── minors.txt # URLs for minor scraping
|
| 71 |
+
│ ├── regulations.txt # URLs for academic rules
|
| 72 |
+
│ └── study_abroad_urls.txt # URLs for study abroad scraping
|
| 73 |
+
├── app.py # Main Gradio application
|
| 74 |
+
├── app_shiny.py # Shiny web application (alternative)
|
| 75 |
+
├── chatbot.py # Core chatbot functionality
|
| 76 |
+
├── collect_data.py # Data collection script
|
| 77 |
+
├── setup.py # Package setup configuration
|
| 78 |
+
├── requirements.txt # Dependencies
|
| 79 |
+
├── runtime.txt # Python version for deployment
|
| 80 |
+
└── init.py # Environment setup script
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## Setup Instructions
|
| 84 |
+
|
| 85 |
+
### 1. Initialize the Environment
|
| 86 |
+
|
| 87 |
+
The easiest way to set up is using the included initialization script, which supports both Conda and virtual environments:
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
# Run the initialization script
|
| 91 |
+
python init.py
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
The script will:
|
| 95 |
+
- Create the chosen environment type
|
| 96 |
+
- Install dependencies
|
| 97 |
+
- Install Playwright browser
|
| 98 |
+
- Create necessary configuration files
|
| 99 |
+
|
| 100 |
+
You can also specify your preferences via command line:
|
| 101 |
+
```bash
|
| 102 |
+
# For Conda environment
|
| 103 |
+
python init.py --env-type conda
|
| 104 |
+
|
| 105 |
+
# For virtual environment
|
| 106 |
+
python init.py --env-type venv
|
| 107 |
+
|
| 108 |
+
# With custom environment name
|
| 109 |
+
python init.py --env-name my-chatbot-env
|
| 110 |
+
|
| 111 |
+
# With specific Python version
|
| 112 |
+
python init.py --python-version 3.12.7
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### 2. Activate the Environment
|
| 116 |
+
|
| 117 |
+
After initialization, activate your environment:
|
| 118 |
+
|
| 119 |
+
For Conda:
|
| 120 |
+
```bash
|
| 121 |
+
conda activate chatbot_env # or your custom name
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
For virtual environment:
|
| 125 |
+
```bash
|
| 126 |
+
# On Windows
|
| 127 |
+
chatbot_env\Scripts\activate
|
| 128 |
+
|
| 129 |
+
# On macOS/Linux
|
| 130 |
+
source chatbot_env/bin/activate
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### 3. Configuration
|
| 134 |
+
|
| 135 |
+
1. Create a `.env` file with your Hugging Face API key:
|
| 136 |
+
```
|
| 137 |
+
HF_API_KEY=your_api_key_here
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
2. (Optional) Configure the embedding model in `config.json`:
|
| 141 |
+
```json
|
| 142 |
+
{
|
| 143 |
+
"embedding_model": {
|
| 144 |
+
"size": "e5"
|
| 145 |
+
}
|
| 146 |
+
}
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### 4. Data Collection
|
| 150 |
+
|
| 151 |
+
To gather academic information for the chatbot:
|
| 152 |
+
|
| 153 |
+
```bash
|
| 154 |
+
# Run all scrapers
|
| 155 |
+
python collect_data.py
|
| 156 |
+
|
| 157 |
+
# Run a specific scraper
|
| 158 |
+
python collect_data.py --scraper courses
|
| 159 |
+
python collect_data.py --scraper programs
|
| 160 |
+
python collect_data.py --scraper minors
|
| 161 |
+
|
| 162 |
+
# Enable debug mode to save detailed outputs
|
| 163 |
+
python collect_data.py --debug --save-json
|
| 164 |
+
|
| 165 |
+
# Use a specific embedding model
|
| 166 |
+
python collect_data.py --model e5
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
The scrapers are configured in `config/scrapers_config.json` and use URL lists from the corresponding text files in the `config` directory.
|
| 170 |
+
|
| 171 |
+
### 5. Running the Application
|
| 172 |
+
|
| 173 |
+
Run the Gradio app:
|
| 174 |
+
```bash
|
| 175 |
+
python app.py
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
Or run the Shiny app (if installed):
|
| 179 |
+
```bash
|
| 180 |
+
shiny run app_shiny.py
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
## Embedding Models
|
| 184 |
+
|
| 185 |
+
The system supports multiple embedding models for different use cases:
|
| 186 |
+
|
| 187 |
+
| Model | Description | Dimensions | Best For |
|
| 188 |
+
|-------|-------------|------------|----------|
|
| 189 |
+
| small | sentence-transformers/all-MiniLM-L6-v2 | 384 | Fast performance, limited resources |
|
| 190 |
+
| medium | sentence-transformers/all-mpnet-base-v2 | 768 | Good balance of quality and performance |
|
| 191 |
+
| large | sentence-transformers/all-roberta-large-v1 | 1024 | Better quality, more resources |
|
| 192 |
+
| multilingual | paraphrase-multilingual-MiniLM-L12-v2 | 384 | Content in multiple languages |
|
| 193 |
+
| e5 | intfloat/e5-large-v2 | 1024 | Highest quality retrieval, requires more resources |
|
| 194 |
+
|
| 195 |
+
You can select your preferred model in `config.json` or with command-line arguments.
|
| 196 |
+
|
| 197 |
+
## ChromaDB Explorer Utility
|
| 198 |
+
|
| 199 |
+
The project includes a utility for exploring and managing the data in ChromaDB:
|
| 200 |
+
|
| 201 |
+
```bash
|
| 202 |
+
# Start interactive mode
|
| 203 |
+
python utils/chroma_explorer.py --interactive
|
| 204 |
+
|
| 205 |
+
# Get statistics about your collection
|
| 206 |
+
python utils/chroma_explorer.py stats
|
| 207 |
+
|
| 208 |
+
# Export documents to JSON
|
| 209 |
+
python utils/chroma_explorer.py export --output data/export.json
|
| 210 |
+
|
| 211 |
+
# Create a human-readable text dump
|
| 212 |
+
python utils/chroma_explorer.py dump --output debug/chroma_dump.txt
|
| 213 |
+
|
| 214 |
+
# Search for specific content
|
| 215 |
+
python utils/chroma_explorer.py search "data science program requirements" --results 10
|
| 216 |
+
|
| 217 |
+
# Delete documents (use with caution!)
|
| 218 |
+
python utils/chroma_explorer.py delete --type program
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
## Dependency Management
|
| 222 |
+
|
| 223 |
+
The project includes a dependency checker to ensure all required packages are properly installed:
|
| 224 |
+
|
| 225 |
+
```bash
|
| 226 |
+
# Check dependencies in the current directory
|
| 227 |
+
python utils/check_dependencies.py
|
| 228 |
+
|
| 229 |
+
# Check dependencies in a specific directory
|
| 230 |
+
python utils/check_dependencies.py /path/to/project
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
The checker will:
|
| 234 |
+
1. Find all imports in Python files
|
| 235 |
+
2. Compare them to packages in requirements.txt
|
| 236 |
+
3. Identify missing or extra dependencies
|
| 237 |
+
4. Generate a proposed requirements.txt with version pinning
|
| 238 |
+
|
| 239 |
+
## Debug Mode
|
| 240 |
+
|
| 241 |
+
All scrapers support a debug mode that provides detailed information during execution:
|
| 242 |
+
|
| 243 |
+
```bash
|
| 244 |
+
# Enable debug mode for program scraper
|
| 245 |
+
python scrapers/program_scraper.py --url "https://www.american.edu/cas/mathstat/data-undergrad/" --debug
|
| 246 |
+
|
| 247 |
+
# Enable debug mode for course scraper
|
| 248 |
+
python scrapers/course_scraper.py --debug
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
In debug mode:
|
| 252 |
+
- Detailed logs are saved to the `logs` directory
|
| 253 |
+
- Screenshots of visited pages are taken
|
| 254 |
+
- Raw HTML content is preserved
|
| 255 |
+
- Extracted data is saved as JSON
|
| 256 |
+
- More verbose console output is provided
|
| 257 |
+
|
| 258 |
+
## Deployment
|
| 259 |
+
|
| 260 |
+
### Deploying to Hugging Face Spaces
|
| 261 |
+
|
| 262 |
+
1. Make sure you have a `runtime.txt` file specifying the Python version:
|
| 263 |
+
```
|
| 264 |
+
python-3.12.7
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
2. Fork this repository to your GitHub account
|
| 268 |
+
|
| 269 |
+
3. Create a new Hugging Face Space using the Gradio SDK
|
| 270 |
+
|
| 271 |
+
4. Connect your GitHub repository to the Space
|
| 272 |
+
|
| 273 |
+
5. Configure any necessary environment variables (HF_API_KEY)
|
| 274 |
+
|
| 275 |
+
## Troubleshooting
|
| 276 |
+
|
| 277 |
+
- **API Key Issues**: Ensure your Hugging Face API key is valid and has access to the Mistral 7B model
|
| 278 |
+
- **ChromaDB Errors**: Make sure the `chroma_db` directory is writable
|
| 279 |
+
- **Scraping Failures**: Check the scraping logs in `logs/` or detailed debug output in `debug/`
|
| 280 |
+
- **Dependency Issues**: Run the dependency checker to identify missing packages
|
| 281 |
+
- **Model Compatibility**: If you encounter memory issues, try a smaller embedding model
|
| 282 |
+
|
| 283 |
+
## License
|
| 284 |
+
|
| 285 |
+
[](https://creativecommons.org/licenses/by-nc/4.0/)
|
| 286 |
+
|
| 287 |
+
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
|
| 288 |
+
|
| 289 |
+
## Acknowledgments
|
| 290 |
+
|
| 291 |
+
- American University for the course and program information.
|
| 292 |
+
- Hugging Face for providing access to the Mistral 7B model.
|
| 293 |
+
- The open source community for the various libraries used in this project.
|