rosvend commited on
Commit
a4e9f02
Β·
1 Parent(s): d767722

Add README.md for HuggingFace Space

Browse files
Files changed (1) hide show
  1. README.md +128 -353
README.md CHANGED
@@ -1,386 +1,161 @@
1
- # UPB RAG Career Exploration Assistant
2
-
3
- A Retrieval-Augmented Generation (RAG) system to help prospective students explore UPB's engineering programs conversationally in Spanish. The system uses manually curated markdown documents and provides multi-strategy retrieval for accurate, context-aware responses.
4
-
5
- ## Features
6
-
7
- - Conversational AI with GPT-4o-mini for natural interactions
8
- - Multi-strategy retrieval: BM25, Vector Similarity, MMR, and Hybrid RRF
9
- - Conversation memory for multi-turn dialogues
10
- - Source citations for transparency
11
- - Spanish language optimized
12
- - 16 curated documents covering 12 engineering programs
13
- - 217 optimized chunks for efficient retrieval
14
-
15
- ## Tech Stack
16
-
17
- | Component | Technology |
18
- |-----------|-----------|
19
- | **LLM** | Azure ChatOpenAI / OpenAI GPT-4o-mini |
20
- | **Embeddings** | AzureOpenAIEmbeddings (text-embedding-3-small) |
21
- | **Vector Store** | FAISS (CPU version) |
22
- | **Framework** | LangChain 1.0.2 |
23
- | **Retrieval** | BM25 (rank-bm25) + MMR + RRF Ensemble |
24
- | **UI** | Gradio *(planned)* |
25
- | **Deployment** | Hugging Face Spaces *(planned)* |
26
- | **Package Manager** | UV |
27
-
28
- ## Project Structure
29
-
30
- ```
31
- .
32
- β”œβ”€β”€ data/ # Curated markdown content
33
- β”‚ β”œβ”€β”€ about_upb.md # University information
34
- β”‚ β”œβ”€β”€ contact/ # Contact information
35
- β”‚ β”œβ”€β”€ engineerings/ # Engineering program details (12 programs)
36
- β”‚ β”œβ”€β”€ enroll/ # Enrollment information
37
- β”‚ └── scholarships/ # Financial aid & scholarships
38
- β”‚
39
- β”œβ”€β”€ src/
40
- β”‚ β”œβ”€β”€ embeddings/
41
- β”‚ β”‚ └── embeddings.py # Azure/OpenAI embeddings initialization
42
- β”‚ β”œβ”€β”€ loader/
43
- β”‚ β”‚ └── ingest.py # Document loader with metadata enrichment
44
- β”‚ β”œβ”€β”€ processing/
45
- β”‚ β”‚ └── chunking.py # Smart text chunking module
46
- β”‚ β”œβ”€β”€ rag/
47
- β”‚ β”‚ └── chain.py # RAG chain with conversation memory
48
- β”‚ β”œβ”€β”€ retrieval/
49
- β”‚ β”‚ └── retriever.py # Multi-strategy retriever (BM25/MMR/Hybrid)
50
- β”‚ β”œβ”€β”€ vectorstore/
51
- β”‚ β”‚ └── store.py # FAISS vector store manager
52
- β”‚ β”œβ”€β”€ pipeline.py # Document preparation pipeline
53
- β”‚ └── setup_retrieval.py # Complete retrieval system setup
54
- β”‚
55
- β”œβ”€β”€ vectorstore/ # FAISS index files (gitignored)
56
- └── pyproject.toml # Dependencies (UV)
57
- ```
58
-
59
- ## πŸš€ Getting Started
60
-
61
- ### Prerequisites
62
-
63
- - Python 3.12
64
- - [UV](https://docs.astral.sh/uv/) package manager
65
-
66
-
67
- ### Installation
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ```bash
70
- # Clone the repository
71
- git clone https://github.com/Rosvend/UPB-RAG-Careers.git
72
- cd UPB-RAG-Careers
73
-
74
- # Install dependencies with UV
75
- uv sync
76
- ```
77
-
78
- ### Configuration
79
-
80
- Create a `.env` file with your Azure OpenAI credentials (for embeddings & LLM):
81
-
82
- ```env
83
- AZURE_OPENAI_API_KEY=your_key_here
84
  AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
85
  AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-small
86
  AZURE_OPENAI_LLM_DEPLOYMENT=gpt-4o-mini
87
  ```
88
 
89
- ### Running the Pipeline
90
-
91
- #### 1. **RAG Chain with Conversation** (Full System)
92
- ```bash
93
- # Test complete RAG chain with GPT-4o-mini, memory, and source citations
94
- uv run python src/rag/chain.py
95
- ```
96
- **What it does**:
97
- - Sets up complete retrieval system
98
- - Initializes GPT-4o-mini LLM
99
- - Tests multi-turn conversation
100
- - Shows source citations
101
- - Demonstrates conversation memory
102
 
103
- **Output**: Complete conversational RAG system test
 
 
 
 
104
 
105
- #### 2. **Complete Retrieval Setup**
106
- ```bash
107
- # Set up embeddings, vector store, and all retrieval methods
108
- uv run python src/setup_retrieval.py
109
- ```
110
- **What it does**:
111
- - Loads 16 markdown documents
112
- - Creates ~217 optimized chunks
113
- - Initializes Azure OpenAI embeddings
114
- - Creates/loads FAISS vector store
115
- - Tests all retrieval methods (BM25, Similarity, MMR, Hybrid)
116
 
117
- **Output**: Fully initialized retrieval system ready for RAG
118
 
119
- #### 3. **Individual Modules**
 
 
 
120
 
121
- **Load Documents**
122
- ```bash
123
- uv run python src/loader/ingest.py
124
- ```
125
- **Output**: 16 documents with category metadata
126
 
127
- **Chunk Documents**
128
- ```bash
129
- uv run python src/processing/chunking.py
130
- ```
131
- **Output**: ~217 chunks (avg 792 chars)
132
-
133
- **Test Embeddings**
134
- ```bash
135
- uv run python src/embeddings/embeddings.py
136
- ```
137
- **Output**: Embedding model initialization test
138
 
139
- **Test Vector Store**
140
- ```bash
141
- uv run python src/vectorstore/store.py
142
- ```
143
- **Output**: FAISS index creation, save, and load test
144
 
145
- **Test Retrieval**
146
  ```bash
147
- uv run python src/retrieval/retriever.py
148
- ```
149
- **Output**: BM25 retrieval test (no embeddings needed)
150
-
151
- ## Module Documentation
152
-
153
- ### `src/embeddings/embeddings.py`
154
- Manages embedding model initialization with dual provider support.
155
- - **Azure OpenAI**: Primary provider with text-embedding-3-small
156
- - **OpenAI**: Fallback provider
157
- - Environment variable validation
158
- - Test mode for verification
159
-
160
- **Usage**:
161
- ```python
162
- from embeddings.embeddings import get_embeddings
163
-
164
- # Azure (default)
165
- embeddings = get_embeddings(provider="azure")
166
-
167
- # OpenAI fallback
168
- embeddings = get_embeddings(provider="openai")
169
- ```
170
-
171
- ### `src/vectorstore/store.py`
172
- FAISS vector store manager for efficient similarity search.
173
- - Create index from documents
174
- - Save/load to disk
175
- - Incremental document additions
176
- - Multiple search modes (similarity, MMR)
177
- - Convert to retriever interface
178
-
179
- **Key Features**:
180
- - Persistent storage (saves to `vectorstore/faiss_index/`)
181
- - Fast similarity search with FAISS CPU
182
- - MMR support for diverse results
183
- - Seamless integration with UPBRetriever
184
-
185
- **Usage**:
186
- ```python
187
- from vectorstore.store import VectorStoreManager
188
- from embeddings.embeddings import get_embeddings
189
-
190
- embeddings = get_embeddings()
191
- manager = VectorStoreManager(embeddings)
192
-
193
- # Create from documents
194
- manager.create_from_documents(chunks)
195
- manager.save("vectorstore/faiss_index")
196
-
197
- # Load existing
198
- manager.load("vectorstore/faiss_index")
199
-
200
- # Search
201
- results = manager.similarity_search("query", k=4)
202
- ```
203
-
204
- ### `src/loader/ingest.py`
205
- Loads markdown files with automatic category detection based on folder structure.
206
- - Supports progress tracking
207
- - Multithreaded loading
208
- - Metadata enrichment
209
-
210
- ### `src/processing/chunking.py`
211
- Intelligent text splitting with context preservation.
212
- - Paragraph-aware chunking
213
- - Configurable size and overlap
214
- - Preserves all metadata
215
- - Tracks chunk position
216
-
217
- ### `src/retrieval/retriever.py`
218
- Multi-strategy retrieval system with **Reciprocal Rank Fusion (RRF)**.
219
- - **BM25**: Keyword-based sparse retrieval (Okapi BM25)
220
- - **Similarity**: Dense vector search with embeddings
221
- - **MMR**: Maximal Marginal Relevance for diverse results
222
- - **Hybrid**: Ensemble with RRF algorithm (from `langchain-classic`)
223
- - Uses Reciprocal Rank Fusion to intelligently merge BM25 + vector results
224
- - Better than simple concatenation: boosts docs appearing in both retrievers
225
- - Handles different scoring scales and provides better diversity control
226
-
227
- **Why RRF?** Documents that appear in both BM25 and vector search get higher scores,
228
- indicating they're relevant both keyword-wise AND semantically. This produces better
229
- results than either method alone.
230
-
231
- **Usage**:
232
- ```python
233
- from retrieval.retriever import UPBRetriever
234
- from setup_retrieval import setup_retrieval_system
235
-
236
- # Full setup
237
- retriever, vectorstore_manager, chunks = setup_retrieval_system()
238
-
239
- # Different retrieval strategies
240
- query = "ingenierΓ­a de sistemas inteligencia artificial"
241
-
242
- # BM25 only (keyword matching)
243
- results = retriever.retrieve(query, method="bm25", k=4)
244
-
245
- # Similarity search (semantic)
246
- results = retriever.retrieve(query, method="similarity", k=4)
247
-
248
- # MMR (diverse results)
249
- results = retriever.retrieve(query, method="mmr", k=4)
250
-
251
- # Hybrid with RRF (recommended)
252
- results = retriever.retrieve(query, method="hybrid", k=4)
253
-
254
- # Custom hybrid weights
255
- results = retriever.retrieve(
256
- query,
257
- method="hybrid",
258
- k=4,
259
- weights=[0.3, 0.7] # [bm25_weight, vector_weight]
260
- )
261
- ```
262
-
263
- ### `src/setup_retrieval.py`
264
- Complete retrieval system initialization and testing.
265
- - One-function setup for entire retrieval pipeline
266
- - Automatic vector store creation/loading
267
- - Multi-method comparison testing
268
- - Production-ready configuration
269
-
270
- **Quick Start**:
271
- ```python
272
- from setup_retrieval import setup_retrieval_system
273
-
274
- # Initialize everything
275
- retriever, vectorstore_manager, chunks = setup_retrieval_system()
276
 
277
- # Ready to use!
278
- results = retriever.retrieve("your query", method="hybrid", k=4)
279
- ```
280
 
281
- ### `src/rag/chain.py`
282
- Conversational RAG chain with GPT-4o-mini and memory.
283
- - Multi-turn conversation support
284
- - Conversation history tracking
285
- - Source citations with document metadata
286
- - Spanish language optimized prompts
287
- - Hybrid retrieval integration
288
-
289
- **Features**:
290
- - Maintains context across multiple questions
291
- - Provides document sources for transparency
292
- - Friendly, professional tone in Spanish
293
- - Suggests related programs when appropriate
294
-
295
- **Usage**:
296
- ```python
297
- from rag.chain import UPBRAGChain
298
- from setup_retrieval import setup_retrieval_system
299
-
300
- # Setup
301
- retriever, _, _ = setup_retrieval_system()
302
- rag_chain = UPBRAGChain(retriever, retrieval_method="hybrid")
303
-
304
- # Ask questions
305
- response = rag_chain.invoke(
306
- "ΒΏQuΓ© carrera debo estudiar si me gusta la IA?",
307
- include_sources=True
308
- )
309
-
310
- print(response['answer'])
311
- for source in response['sources']:
312
- print(f"- {source['category']}: {source['source']}")
313
-
314
- # Continue conversation (memory is maintained)
315
- response2 = rag_chain.invoke("ΒΏQuΓ© requisitos necesito?")
316
-
317
- # Clear history when needed
318
- rag_chain.clear_history()
319
- ```
320
 
321
- # Ready to use!
322
- results = retriever.retrieve("your query", method="hybrid", k=4)
323
  ```
324
 
325
- ### `src/pipeline.py`
326
- Orchestrates the complete data preparation flow.
327
- - One-function interface
328
- - Flexible configuration
329
- - Detailed statistics output
330
 
331
- ## Quick Start Examples
332
-
333
- ### Interactive Chat
334
- ```bash
335
- # Run interactive chat interface
336
- uv run python src/example_usage.py
337
- ```
338
 
339
- Type your questions in Spanish and the assistant will respond using the RAG system. Commands:
340
- - `salir` - Exit the chat
341
- - `limpiar` - Clear conversation history
342
 
343
- ### Programmatic Usage
344
 
345
- **Basic RAG Query**:
346
- ```python
347
- from setup_retrieval import setup_retrieval_system
348
- from rag.chain import UPBRAGChain
349
 
350
- # Initialize
351
- retriever, _, _ = setup_retrieval_system()
352
- rag_chain = UPBRAGChain(retriever, retrieval_method="hybrid")
353
 
354
- # Ask question
355
- response = rag_chain.invoke("ΒΏQuΓ© es la ingenierΓ­a de sistemas?")
356
- print(response['answer'])
357
- ```
358
 
359
- **With Source Citations**:
360
- ```python
361
- response = rag_chain.invoke(
362
- "ΒΏQuΓ© becas estΓ‘n disponibles?",
363
- include_sources=True
364
- )
365
-
366
- print(response['answer'])
367
- print("\nFuentes:")
368
- for source in response['sources']:
369
- print(f"- {source['category']}: {source['source']}")
370
- ```
371
 
372
- **Multi-turn Conversation**:
373
- ```python
374
- # First question
375
- r1 = rag_chain.invoke("ΒΏQuΓ© ingenierΓ­as tienen?")
376
 
377
- # Follow-up (uses conversation memory)
378
- r2 = rag_chain.invoke("ΒΏCuΓ‘l me recomiendas si me gusta programar?")
379
 
380
- # Another follow-up
381
- r3 = rag_chain.invoke("ΒΏCuΓ‘nto dura ese programa?")
 
 
382
 
383
- # Clear history when done
384
- rag_chain.clear_history()
385
- ```
386
 
 
 
1
+ ---
2
+ title: UPB Careers Assistant
3
+ emoji: πŸŽ“
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ python_version: 3.12
12
+ ---
13
+
14
+ # UPB RAG Career Exploration Assistant πŸŽ“
15
+
16
+ An intelligent conversational assistant powered by Retrieval-Augmented Generation (RAG) to help prospective students explore engineering programs at Universidad Pontificia Bolivariana (UPB) in MedellΓ­n, Colombia.
17
+
18
+ ## πŸš€ Features
19
+
20
+ - **Conversational AI**: Natural Spanish language interactions with GPT-4o-mini
21
+ - **Comprehensive Information**: Covers 12 engineering programs, admissions, scholarships, and contact details
22
+ - **Source Citations**: Transparent answers with document references
23
+ - **Smart Retrieval**: Hybrid search combining BM25 and vector similarity with Reciprocal Rank Fusion
24
+ - **Metadata Index**: Quick access to program lists and accreditation information
25
+
26
+ ## 🎯 What You Can Ask
27
+
28
+ - **Program Information**: "ΒΏQuΓ© ingenierΓ­as ofrece la UPB?"
29
+ - **Accreditations**: "ΒΏCuΓ‘les programas tienen acreditaciΓ³n ABET?"
30
+ - **Specific Programs**: "CuΓ©ntame sobre IngenierΓ­a de Sistemas"
31
+ - **Admissions**: "ΒΏCΓ³mo puedo inscribirme?"
32
+ - **Scholarships**: "ΒΏQuΓ© becas estΓ‘n disponibles?"
33
+ - **Contact**: "ΒΏCΓ³mo contacto a la UPB?"
34
+
35
+ ## πŸ—οΈ Architecture
36
+
37
+ ### Tech Stack
38
+ - **LLM**: Azure GPT-4o-mini (temperature=0.0)
39
+ - **Embeddings**: Azure text-embedding-3-small
40
+ - **Vector Store**: FAISS (CPU)
41
+ - **Framework**: LangChain 1.0.2
42
+ - **Retrieval**: Hybrid (BM25 + Vector with RRF)
43
+ - **UI**: Gradio 4.0+
44
+ - **Deployment**: HuggingFace Spaces
45
+
46
+ ### System Components
47
+ 1. **Document Ingestion**: 16 curated markdown files
48
+ 2. **Chunking**: 217 optimized chunks (~792 chars avg)
49
+ 3. **Metadata Index**: JSON-based knowledge graph for quick lookups
50
+ 4. **Multi-Strategy Retrieval**: BM25, Similarity, MMR, Hybrid
51
+ 5. **RAG Chain**: Conversational chain with memory
52
+
53
+ ## πŸ“Š Coverage
54
+
55
+ - **Programs**: 12 engineering programs
56
+ - **Documents**: 16 markdown files
57
+ - **Chunks**: 217 processed fragments
58
+ - **Categories**: Engineering programs, General info, Admissions, Scholarships, Contact
59
+
60
+ ### Engineering Programs Included
61
+
62
+ 1. IngenierΓ­a Administrativa
63
+ 2. IngenierΓ­a AeronΓ‘utica
64
+ 3. IngenierΓ­a Agroindustrial
65
+ 4. IngenierΓ­a en Ciencia de Datos
66
+ 5. IngenierΓ­a ElΓ©ctrica
67
+ 6. IngenierΓ­a ElectrΓ³nica
68
+ 7. IngenierΓ­a en DiseΓ±o de Entretenimiento Digital (IDED)
69
+ 8. IngenierΓ­a Industrial
70
+ 9. IngenierΓ­a MecΓ‘nica
71
+ 10. IngenierΓ­a en NanotecnologΓ­a
72
+ 11. IngenierΓ­a QuΓ­mica
73
+ 12. IngenierΓ­a de Sistemas e InformΓ‘tica
74
+
75
+ ## βš™οΈ Configuration
76
+
77
+ ### Environment Variables
78
+
79
+ This app requires the following Azure OpenAI credentials (configured as HuggingFace Secrets):
80
 
81
  ```bash
82
+ AZURE_OPENAI_API_KEY=your_api_key_here
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
84
  AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-small
85
  AZURE_OPENAI_LLM_DEPLOYMENT=gpt-4o-mini
86
  ```
87
 
88
+ ## πŸ§ͺ Testing
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
+ The system includes a comprehensive test suite (`src/comprehensive_test.py`) covering:
91
+ - βœ… Hallucination prevention
92
+ - βœ… Completeness (program listings)
93
+ - βœ… Factual accuracy
94
+ - βœ… Accreditation information
95
 
96
+ Current test results: See `TEST_RESULTS_SUMMARY.md` for detailed analysis.
 
 
 
 
 
 
 
 
 
 
97
 
98
+ ## πŸ“ Limitations
99
 
100
+ - Information is based on manually curated documents (October 2025)
101
+ - For specific details on costs, exact dates, or recent changes, contact UPB directly
102
+ - The system may not have information about very specific course details
103
+ - Optimized for Spanish language queries
104
 
105
+ ## πŸ› οΈ Local Development
 
 
 
 
106
 
107
+ ### Prerequisites
108
+ - Python 3.12
109
+ - UV package manager
 
 
 
 
 
 
 
 
110
 
111
+ ### Setup
 
 
 
 
112
 
 
113
  ```bash
114
+ # Clone repository
115
+ git clone https://github.com/Rosvend/UPB-RAG-Careers.git
116
+ cd UPB-RAG-Careers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
+ # Install dependencies
119
+ uv sync
 
120
 
121
+ # Create .env file with Azure credentials
122
+ cp .env.example .env
123
+ # Edit .env with your credentials
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
+ # Run locally
126
+ uv run python app.py
127
  ```
128
 
129
+ The app will be available at `http://localhost:7860`
 
 
 
 
130
 
131
+ ## πŸ“š Documentation
 
 
 
 
 
 
132
 
133
+ - **README.md**: Full technical documentation
134
+ - **TEST_RESULTS_SUMMARY.md**: Comprehensive test analysis
135
+ - **data/**: Source markdown documents
136
 
137
+ ## 🀝 Contributing
138
 
139
+ This is an academic project for the Multimedia Mining course at Universidad Pontificia Bolivariana. Contributions, issues, and feature requests are welcome!
 
 
 
140
 
141
+ ## πŸ“„ License
 
 
142
 
143
+ MIT License - See LICENSE file for details
 
 
 
144
 
145
+ ## πŸ‘¨β€πŸ’» Author
 
 
 
 
 
 
 
 
 
 
 
146
 
147
+ **Rosvend**
148
+ - Universidad Pontificia Bolivariana
149
+ - Multimedia Mining Course, 6th Semester
150
+ - October 2025
151
 
152
+ ## πŸ™ Acknowledgments
 
153
 
154
+ - Universidad Pontificia Bolivariana for institutional information
155
+ - LangChain community for the RAG framework
156
+ - HuggingFace for hosting infrastructure
157
+ - Azure OpenAI for AI capabilities
158
 
159
+ ---
 
 
160
 
161
+ **Note**: This assistant provides general information about UPB engineering programs. For official admissions information, dates, costs, and specific curriculum details, please contact UPB directly through their official channels.