chaaim123 commited on
Commit
ed84fb4
·
verified ·
1 Parent(s): 925c54d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +286 -5
README.md CHANGED
@@ -1,12 +1,293 @@
1
  ---
2
- title: Demo10
3
- emoji: 👁
4
- colorFrom: purple
5
  colorTo: indigo
6
  sdk: docker
7
- sdk_version: 5.26.0
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: American University Academic Advisor
3
+ emoji: 🎓
4
+ colorFrom: blue
5
  colorTo: indigo
6
  sdk: docker
7
+ sdk_version: 4.26.0
8
  app_file: app.py
9
  pinned: false
10
+ license: cc-by-nc-4.0
11
+ short_description: Research Chatbot for AU advising questions
12
  ---
13
 
14
+ # American University Academic Advisor Chatbot
15
+
16
+ **This is an ongoing student research project under development for academic purposes. The data focuses on mathematics, statistics, and data science programs and is incomplete. Users are cautioned that LLM may given incomplete or incorrect answers and they should check any responses independently using authoritative sources and their advisors before making any decisions.**
17
+
18
+ A RAG (Retrieval-Augmented Generation) chatbot using Mistral 7B and ChromaDB to answer questions about American University academic programs and courses. This RAG chatbot uses information from a variety of American University public websites to provide context for user queries when they are passed to the generative AI chatbot model. The information is "retrieved" from a database of scrapped information and used to "augment" the query to the generative AI model.
19
+
20
+ The database is populated with information from the following sources at American University for the 2024-2025 Academic year.
21
+
22
+ - Academic program pages for undergraduate majors and masters degrees in the Mathematics and Statistics department.
23
+ - Academic program pages for undergraduate minors in the Mathematics and Statistics department.
24
+ - The Course Catalog for courses with identifiers: DATA, STAT, MATH, CSC, and ITEC.
25
+ - Undergraduate and and Graduate Academic Regulations
26
+ - Study Abroad introductory pages and program pages for majors in Data Science, Mathematics, Statistics, and Computer Science
27
+
28
+ ## Overview
29
+
30
+ This chatbot application uses:
31
+
32
+ - **Shiny**: For the web interface
33
+ - **ChromaDB**: As vector database for storing and retrieving academic information
34
+ - **Sentence-Transformers**: For text embeddings
35
+ - **Mistral 7B**: For generating human-like responses
36
+ - **Playwright**: For web scraping academic program and course information
37
+
38
+ The application is structured to allow for scraping different types of content (academic programs, courses, minors, academic rules, study abroad options) and storing them in a vector database for efficient retrieval.
39
+
40
+ ## Project Structure
41
+
42
+ ```
43
+ au_advisor/
44
+ ├── scrapers/ # Web scrapers for different data sources
45
+ │ ├── course_scraper.py # Scraper for course catalog
46
+ │ ├── program_scraper.py # Scraper for academic programs
47
+ │ ├── minor_scraper.py # Scraper for minor programs
48
+ │ ├── study_abroad_scraper.py # Scraper for study abroad options
49
+ │ ├── academic_rules_scraper.py # Scraper for academic regulations
50
+ │ └── extract_grad_regulations.py # Helper for graduate regulations
51
+ ├── utils/ # Utility modules
52
+ │ ├── chroma_utils.py # ChromaDB operations
53
+ │ ├── config_utils.py # Configuration management
54
+ │ ├── scraper_utils.py # Common scraper utilities
55
+ │ ├── metadata_utils.py # Metadata enhancement utilities
56
+ │ ├── db_manager.py # Database management utilities
57
+ │ ├── logging_utils.py # Centralized logging
58
+ │ ├── chroma_explorer.py # Utility for exploring ChromaDB data
59
+ │ ├── directory_scan.py # Scan directory structures
60
+ │ └── check_dependencies.py # Dependency checker
61
+ ├── utils_deploy/ # Deployment utilities
62
+ │ └── check_dependencies.py # Dependency checker for deployment
63
+ ├── config/ # Configuration files
64
+ │ ├── scrapers_config.json # Scraper configurations
65
+ │ ├── models.txt # Embedding model mappings
66
+ │ ├── keys.txt # Authentication keys (template)
67
+ │ ├── repo_config.json # Repository configuration
68
+ │ ├── program_urls.txt # URLs for program scraping
69
+ │ ├── course_urls.txt # URLs for course scraping
70
+ │ ├── minors.txt # URLs for minor scraping
71
+ │ ├── regulations.txt # URLs for academic rules
72
+ │ └── study_abroad_urls.txt # URLs for study abroad scraping
73
+ ├── app.py # Main Gradio application
74
+ ├── app_shiny.py # Shiny web application (alternative)
75
+ ├── chatbot.py # Core chatbot functionality
76
+ ├── collect_data.py # Data collection script
77
+ ├── setup.py # Package setup configuration
78
+ ├── requirements.txt # Dependencies
79
+ ├── runtime.txt # Python version for deployment
80
+ └── init.py # Environment setup script
81
+ ```
82
+
83
+ ## Setup Instructions
84
+
85
+ ### 1. Initialize the Environment
86
+
87
+ The easiest way to set up is using the included initialization script, which supports both Conda and virtual environments:
88
+
89
+ ```bash
90
+ # Run the initialization script
91
+ python init.py
92
+ ```
93
+
94
+ The script will:
95
+ - Create the chosen environment type
96
+ - Install dependencies
97
+ - Install Playwright browser
98
+ - Create necessary configuration files
99
+
100
+ You can also specify your preferences via command line:
101
+ ```bash
102
+ # For Conda environment
103
+ python init.py --env-type conda
104
+
105
+ # For virtual environment
106
+ python init.py --env-type venv
107
+
108
+ # With custom environment name
109
+ python init.py --env-name my-chatbot-env
110
+
111
+ # With specific Python version
112
+ python init.py --python-version 3.12.7
113
+ ```
114
+
115
+ ### 2. Activate the Environment
116
+
117
+ After initialization, activate your environment:
118
+
119
+ For Conda:
120
+ ```bash
121
+ conda activate chatbot_env # or your custom name
122
+ ```
123
+
124
+ For virtual environment:
125
+ ```bash
126
+ # On Windows
127
+ chatbot_env\Scripts\activate
128
+
129
+ # On macOS/Linux
130
+ source chatbot_env/bin/activate
131
+ ```
132
+
133
+ ### 3. Configuration
134
+
135
+ 1. Create a `.env` file with your Hugging Face API key:
136
+ ```
137
+ HF_API_KEY=your_api_key_here
138
+ ```
139
+
140
+ 2. (Optional) Configure the embedding model in `config.json`:
141
+ ```json
142
+ {
143
+ "embedding_model": {
144
+ "size": "e5"
145
+ }
146
+ }
147
+ ```
148
+
149
+ ### 4. Data Collection
150
+
151
+ To gather academic information for the chatbot:
152
+
153
+ ```bash
154
+ # Run all scrapers
155
+ python collect_data.py
156
+
157
+ # Run a specific scraper
158
+ python collect_data.py --scraper courses
159
+ python collect_data.py --scraper programs
160
+ python collect_data.py --scraper minors
161
+
162
+ # Enable debug mode to save detailed outputs
163
+ python collect_data.py --debug --save-json
164
+
165
+ # Use a specific embedding model
166
+ python collect_data.py --model e5
167
+ ```
168
+
169
+ The scrapers are configured in `config/scrapers_config.json` and use URL lists from the corresponding text files in the `config` directory.
170
+
171
+ ### 5. Running the Application
172
+
173
+ Run the Gradio app:
174
+ ```bash
175
+ python app.py
176
+ ```
177
+
178
+ Or run the Shiny app (if installed):
179
+ ```bash
180
+ shiny run app_shiny.py
181
+ ```
182
+
183
+ ## Embedding Models
184
+
185
+ The system supports multiple embedding models for different use cases:
186
+
187
+ | Model | Description | Dimensions | Best For |
188
+ |-------|-------------|------------|----------|
189
+ | small | sentence-transformers/all-MiniLM-L6-v2 | 384 | Fast performance, limited resources |
190
+ | medium | sentence-transformers/all-mpnet-base-v2 | 768 | Good balance of quality and performance |
191
+ | large | sentence-transformers/all-roberta-large-v1 | 1024 | Better quality, more resources |
192
+ | multilingual | paraphrase-multilingual-MiniLM-L12-v2 | 384 | Content in multiple languages |
193
+ | e5 | intfloat/e5-large-v2 | 1024 | Highest quality retrieval, requires more resources |
194
+
195
+ You can select your preferred model in `config.json` or with command-line arguments.
196
+
197
+ ## ChromaDB Explorer Utility
198
+
199
+ The project includes a utility for exploring and managing the data in ChromaDB:
200
+
201
+ ```bash
202
+ # Start interactive mode
203
+ python utils/chroma_explorer.py --interactive
204
+
205
+ # Get statistics about your collection
206
+ python utils/chroma_explorer.py stats
207
+
208
+ # Export documents to JSON
209
+ python utils/chroma_explorer.py export --output data/export.json
210
+
211
+ # Create a human-readable text dump
212
+ python utils/chroma_explorer.py dump --output debug/chroma_dump.txt
213
+
214
+ # Search for specific content
215
+ python utils/chroma_explorer.py search "data science program requirements" --results 10
216
+
217
+ # Delete documents (use with caution!)
218
+ python utils/chroma_explorer.py delete --type program
219
+ ```
220
+
221
+ ## Dependency Management
222
+
223
+ The project includes a dependency checker to ensure all required packages are properly installed:
224
+
225
+ ```bash
226
+ # Check dependencies in the current directory
227
+ python utils/check_dependencies.py
228
+
229
+ # Check dependencies in a specific directory
230
+ python utils/check_dependencies.py /path/to/project
231
+ ```
232
+
233
+ The checker will:
234
+ 1. Find all imports in Python files
235
+ 2. Compare them to packages in requirements.txt
236
+ 3. Identify missing or extra dependencies
237
+ 4. Generate a proposed requirements.txt with version pinning
238
+
239
+ ## Debug Mode
240
+
241
+ All scrapers support a debug mode that provides detailed information during execution:
242
+
243
+ ```bash
244
+ # Enable debug mode for program scraper
245
+ python scrapers/program_scraper.py --url "https://www.american.edu/cas/mathstat/data-undergrad/" --debug
246
+
247
+ # Enable debug mode for course scraper
248
+ python scrapers/course_scraper.py --debug
249
+ ```
250
+
251
+ In debug mode:
252
+ - Detailed logs are saved to the `logs` directory
253
+ - Screenshots of visited pages are taken
254
+ - Raw HTML content is preserved
255
+ - Extracted data is saved as JSON
256
+ - More verbose console output is provided
257
+
258
+ ## Deployment
259
+
260
+ ### Deploying to Hugging Face Spaces
261
+
262
+ 1. Make sure you have a `runtime.txt` file specifying the Python version:
263
+ ```
264
+ python-3.12.7
265
+ ```
266
+
267
+ 2. Fork this repository to your GitHub account
268
+
269
+ 3. Create a new Hugging Face Space using the Gradio SDK
270
+
271
+ 4. Connect your GitHub repository to the Space
272
+
273
+ 5. Configure any necessary environment variables (HF_API_KEY)
274
+
275
+ ## Troubleshooting
276
+
277
+ - **API Key Issues**: Ensure your Hugging Face API key is valid and has access to the Mistral 7B model
278
+ - **ChromaDB Errors**: Make sure the `chroma_db` directory is writable
279
+ - **Scraping Failures**: Check the scraping logs in `logs/` or detailed debug output in `debug/`
280
+ - **Dependency Issues**: Run the dependency checker to identify missing packages
281
+ - **Model Compatibility**: If you encounter memory issues, try a smaller embedding model
282
+
283
+ ## License
284
+
285
+ [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
286
+
287
+ This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
288
+
289
+ ## Acknowledgments
290
+
291
+ - American University for the course and program information.
292
+ - Hugging Face for providing access to the Mistral 7B model.
293
+ - The open source community for the various libraries used in this project.