mbudisic commited on
Commit
ded2340
Β·
1 Parent(s): 5c398c0

increased heading level

Browse files
Files changed (1) hide show
  1. ANSWER.md +87 -88
ANSWER.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  Marko Budisic
4
 
5
- ## Deliverables
6
 
7
  1. [Main Github repo](https://github.com/mbudisic/pstuts-rag)
8
  1. [Github repo for creating the golden dataset](https://github.com/adobe-research/PsTuts-VQA-Dataset)
@@ -13,30 +13,29 @@ Marko Budisic
13
  1. [Corpus dataset](https://huggingface.co/datasets/mbudisic/PsTuts-VQA)
14
  1. [Golden Q&A dataset](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa)
15
 
16
- ## ToC
17
 
18
  - [Certification Challenge](#certification-challenge)
19
- - [Deliverables](#deliverables)
20
- - [ToC](#toc)
21
- - [Task 1: Defining your Problem and Audience](#task-1-defining-your-problem-and-audience)
22
- - [Task 2: Propose a Solution](#task-2-propose-a-solution)
23
- - [Task 3: Dealing with the Data](#task-3-dealing-with-the-data)
24
- - [3.1. Data Sources \& External APIs πŸ“Š+🌐:](#31-data-sources--external-apis-)
25
- - [3.2. Chunking Strategy πŸ§ βœ‚οΈ:](#32-chunking-strategy-️)
26
- - [3.3. Specific Data Needs for Other Parts 🧩:](#33-specific-data-needs-for-other-parts-)
27
- - [Task 4: Building a Quick End-to-End Prototype](#task-4-building-a-quick-end-to-end-prototype)
28
- - [4.1. The Prototype Application πŸ–₯️:](#41-the-prototype-application-️)
29
- - [4.2. Deployment πŸš€ (Hugging Face Space):](#42-deployment--hugging-face-space)
30
- - [Task 5: Creating a Golden Test Data Set](#task-5-creating-a-golden-test-data-set)
31
- - [5.1. RAGAS Framework Assessment \& Results πŸ“Š:](#51-ragas-framework-assessment--results-)
32
- - [Task 6: Fine-Tuning Open-Source Embeddings](#task-6-fine-tuning-open-source-embeddings)
33
- - [6.1. Fine-Tuning Process and Model Link πŸ”—:\*\*](#61-fine-tuning-process-and-model-link-)
34
- - [Task 7: Assessing Performance](#task-7-assessing-performance)
35
- - [7.1. Comparative RAGAS Results πŸ“Š:](#71-comparative-ragas-results-)
36
- - [8. Future changes](#8-future-changes)
37
-
38
-
39
- # Task 1: Defining your Problem and Audience
40
 
41
  **Problem:** Navigating extensive libraries of video materials to find specific information is often a time-consuming and inefficient process for users. This challenge is common in organizations that rely on video-based training materials. πŸ˜“
42
 
@@ -44,7 +43,7 @@ Marko Budisic
44
 
45
  _Side note: This is a good approximation of a problem that I am internally solving for my company. The agentic RAG will be augmented further for the demo day._
46
 
47
- # Task 2: Propose a Solution
48
 
49
  **Our Solution:** πŸ—£οΈ An agentic Retrieval Augmented Generation (RAG) system designed to answer questions about a company's video tutorial library (e.g., for software like Adobe Photoshop, or any internal training content). Users interact via a chat interface (Chainlit, as seen in `app.py`). πŸ’» The system queries its knowledge base of tutorial transcripts and can use Tavily for web searches, providing comprehensive answers relevant to the specific video library and serving up videos at the referenced timestampes. 🌐
50
 
@@ -52,26 +51,27 @@ Broader vision is to build an ingestion pipeline that would transcribe audio nar
52
  key frames in the video to further enhance the context.
53
  Users would be able to search not only by a query, but also by a screenshot (e.g. looking up
54
  live video if they have only a screenshot in a company walkthrough).
55
- The agents would not only be able to answer the queries, but also develop a
56
  short presentation, e.g., in `reveal.js` or `remark`.
57
 
58
  **The Tech Stack πŸ› οΈ:**
59
 
60
- * **LLM:** OpenAI model (`gpt-4.1-mini`), selected for strong language capabilities and ease of API access. 🧠
61
- * **Embedding Model:** An open-source model, `Snowflake/snowflake-arctic-embed-s` (see `Fine_Tuning_Embedding_for_PSTuts.ipynb`), fine-tuned for domain-specific relevance. This is a small model trainable on a laptop. ❄️
62
- * **Orchestration:** LangChain & LangGraph, for building the RAG application and managing agent workflows. Many functions have been stored in the `pstuts_rag` package to allow calling from notebooks and app. πŸ”—
63
- * **Vector Database:** Qdrant (`pstuts_rag/datastore.py`), for efficient semantic search of tutorial transcripts. I had most experience with it, and no reason to look elsewhere. πŸ’Ύ
64
- * **Evaluation:** Synthetic data set, [created using RAGAS in a second repo](https://github.com/mbudisic/PsTuts-VQA-Data-Operations), powers `evaluate_rag.ipynb`, for assessing RAG pipeline (w/o the search powers) quality. 🧐
65
- * **Monitoring:** W&B (Weights & Biases)πŸ‹οΈ was used to monitor fine-tuning. LangSmith was enabled for monitoring in general.πŸ“Š
66
- * **User Interface:** Chainlit chat with on-demand display of videos positioned at the correct timestamp. πŸ’¬ πŸ“Ό
67
- * **Serving & Inference:** Docker (`Dockerfile`), for containerized deployment on Hugging Face Spaces. 🐳
68
 
69
- **The Role of Agents πŸ•΅οΈβ€β™‚οΈ:**
70
 
71
  The system uses a LangGraph-orchestrated multi-agent approach:
72
- 1. **Supervisor Agent:** Manages the overall workflow. It receives the user query and routes it to the appropriate specialized agent based on its interpretation of the query (defined in `SUPERVISOR_SYSTEM` prompt and `create_team_supervisor` in `app.py`). πŸ§‘β€βœˆοΈ
73
- 2. **Video Archive Agent (`VIDEOARCHIVE`):** This is the RAG agent. It queries the Qdrant vector store of Photoshop tutorial transcripts to find relevant information and generates an answer based on this retrieved context. (Uses `create_rag_node` from `pstuts_rag.agent_rag`). πŸ“Ό
74
- 3. **Adobe Help Agent (`ADOBEHELP`):** This agent uses the Tavily API to perform web searches if the supervisor deems it necessary for broader or more current information. (Uses `create_tavily_node` from `pstuts_rag.agent_tavily`). 🌍
 
75
  The supervisor then determines if the answer is complete or if further steps are needed. βœ…
76
 
77
  ```
@@ -92,14 +92,14 @@ The supervisor then determines if the answer is complete or if further steps are
92
  +-----------+ +--------------------+ +---------+
93
  ```
94
 
95
- # Task 3: Dealing with the Data
96
 
97
- ## 3.1. Data Sources & External APIs πŸ“Š+🌐:
98
 
99
- * **Primary Data:** [PsTuts-VQA](https://github.com/adobe-research/PsTuts-VQA-Dataset) is a publicly-released set of transcripts linked to a database of Adobe-created Photoshop training videos. Data is in a JSON format, made available on [hf.co:mbudisic/PsTuts-VQA](https://huggingface.co/datasets/mbudisic/PsTuts-VQA). πŸ“
100
- * **External API:** Tavily Search API (configured in `app.py`) augments knowledge with web search results of domain [helpx.adobe.com](https://helpx.adobe.com) via the `ADOBEHELP` agent for current or broader topics not covered in the internal videos. πŸ”
101
 
102
- ## 3.2. Chunking Strategy πŸ§ βœ‚οΈ:
103
 
104
  (see: `pstuts_rag/datastore.py`'s `chunk_transcripts` function and `pstuts_rag/loader.py`)
105
 
@@ -108,38 +108,39 @@ since they are tied to the time windows in which a particular transcript sentenc
108
  be overlaid on the screen.
109
 
110
  Therefore, to achieve a useful semantic chunking for RAG, the following **semantic chunking** strategy is employed:
111
- 1. **Initial Loading:** Transcripts are loaded both entirely per video (`VideoTranscriptBulkLoader`) and as individual sentences/segments with timestamps (`VideoTranscriptChunkLoader`).
112
- 2. **Semantic Splitting:** `SemanticChunker` (LangChain, using `OpenAIEmbeddings`) splits full transcripts into semantically coherent chunks.
113
- 3. **Metadata Enrichment:** These semantic chunks are enriched with start/end times by mapping them back to the original timestamped sentences.
 
114
 
115
  **In summary:** πŸ€” This method (a) creates topically focused chunks for better retrieval. 🎯 (b) links back to video timestamps. πŸ”—
116
 
117
- ## 3.3. Specific Data Needs for Other Parts 🧩:
118
 
119
- * **Evaluation & Golden Dataset (Tasks 5 & 7):** πŸ† Generating the "Golden Data Set" using Knowledge Graph to produce question-answer-context triplet in RAGAS is detailed in `create_golden_dataset.ipynb` (see [`PsTuts-VQA-Data-Operations` repo](https://github.com/mbudisic/PsTuts-VQA-Data-Operations)). The resulting dataset [hf.co:mbudisic/pstuts_rag_-_qa](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa) is used to benchmark the RAG pipeline in `evaluate_rag.ipynb` and fine-tune the embedding model. πŸ“Š
120
 
121
- * **Embedding Model Fine-Tuning (Task 6):** πŸ”¬ The `Fine_Tuning_Embedding_for_PSTuts.ipynb` notebook shows the use of [`hf.co:mbudisic/pstuts_rag_qa`](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa) to fine-tune the embedding model. This adapts models like `Snowflake/snowflake-arctic-embed-s` for improved retrieval. βš™οΈ
122
 
123
- # Task 4: Building a Quick End-to-End Prototype
124
 
125
  An end-to-end prototype RAG system for Photoshop tutorials is built and deployed to HF.
126
 
127
- ## 4.1. The Prototype Application πŸ–₯️:
128
 
129
  The `app.py` script is the core prototype. It uses Chainlit for the UI, LangChain/LangGraph for orchestration, Qdrant for the vector store, and OpenAI models for embeddings and generation. It loads data, builds the RAG chain, and manages the agentic workflow for user queries. ✨
130
 
131
- ## 4.2. Deployment πŸš€ (Hugging Face Space):
132
 
133
  The repository is structured for Hugging Face Space deployment:
134
- * `README.md` contains Hugging Face Space metadata (e.g., `sdk: docker`).
135
- * A `Dockerfile` enables containerization for deployment.
136
  This setup indicates the prototype is packaged for public deployment. 🌍
137
 
138
- # Task 5: Creating a Golden Test Data Set
139
 
140
  The creation of the "Golden Test Data Set" is documented in the `create_golden_dataset.ipynb` notebook in the [`PsTuts-VQA-Data-Operations` repository](https://github.com/mbudisic/PsTuts-VQA-Data-Operations). This dataset was then utilized in the `notebooks/evaluate_rag.ipynb` of the current project to assess the initial RAG pipeline with RAGAS. 🌟
141
 
142
- ## 5.1. RAGAS Framework Assessment & Results πŸ“Š:
143
 
144
  The initial RAG pipeline ("Base" model, `Snowflake/snowflake-arctic-embed-s` before fine-tuning) yielded these mean RAGAS scores:
145
 
@@ -151,36 +152,36 @@ The initial RAG pipeline ("Base" model, `Snowflake/snowflake-arctic-embed-s` bef
151
  | Factual Correctness (mode=f1) | 0.654 |
152
  | Context Entity Recall | 0.636 |
153
 
154
- *(Scores from `notebooks/evaluate_rag.ipynb` output for the "Base" configuration)*
155
 
156
  **2. Conclusions on Performance and Effectiveness 🧐:**
157
 
158
- * **Strengths:** πŸ’ͺ High **Answer Relevancy (0.914)** indicates the system understands queries well.
159
- * **Areas for Improvement:** πŸ“‰
160
- * **Faithfulness (0.721):** Answers are not always perfectly grounded in retrieved context. Maybe if I turned the temperature down to 0 this score would have been higher.
161
- * **Context Recall (0.672):** Not all necessary information is always retrieved.
162
- * **Factual Correctness (0.654):** Factual accuracy of answers needs improvement.
163
- * **Overall:** The baseline system is good at relevant responses but needs better context retrieval and factual accuracy. This benchmarks a clear path for improvements, such as embedding fine-tuning. πŸ› οΈ
164
 
165
- # Task 6: Fine-Tuning Open-Source Embeddings
166
 
167
  To enhance retrieval performance for a specific video library, an open-source embedding model can be fine-tuned on domain-specific data. The following describes an example of this process using Photoshop tutorial data. πŸ§ͺ
168
 
169
- ## 6.1. Fine-Tuning Process and Model Link πŸ”—:**
170
 
171
- * **Base Model:** `Snowflake/snowflake-arctic-embed-s` was chosen as the base model for fine-tuning in this example. The `-s` stands for small -- larger two models ended up taking too much GPU memory on my laptop. ❄️
172
- * **Fine-tuning Data:** The fine-tuning notebook is `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`. It uses the golden dataset, retrieved from the HF repository. πŸ–ΌοΈ The data was split into `train`-`validate`-`test` blocks. `train` was used
173
  to compute the objective function in the training loop, while `validate` was used in evaluation.
174
- * **Monitoring:** πŸ› οΈ W&B tracked the process and evaluation. πŸ“ˆ
175
- * **Resulting Model:** The fine-tuned model (for the Photoshop example) was saved and pushed to the Hugging Face Hub. πŸ€— [mbudisic/snowflake-arctic-embed-s-ft-pstuts](https://huggingface.co/mbudisic/snowflake-arctic-embed-s-ft-pstuts)
176
 
177
- *(Evidence for this is in `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`, specifically the `model.push_to_hub` call and its output. The `app.py` can be (or is) configured to use such a fine-tuned model for the embedding step in the RAG pipeline.)*
178
 
179
- # Task 7: Assessing Performance
180
 
181
  Performance of the RAG application with the fine-tuned embedding model (`mbudisic/snowflake-arctic-embed-s-ft-pstuts`) was assessed using the same RAGAS framework and "Golden Data Set" (`golden_small_hf`) as the baseline. πŸ†
182
 
183
- ## 7.1. Comparative RAGAS Results πŸ“Š:
184
 
185
  (see: `notebooks/evaluate_rag.ipynb` output)
186
 
@@ -203,25 +204,23 @@ appropriate context and fine-tuning did not bring much benefit.
203
 
204
  The Hugging Face live demo runs the fine-tuned model.
205
 
206
- *(Note: These are mean scores. `Factual Correctness` is `factual_correctness(mode=f1)` in the notebook.)*
207
 
208
- # 8. Future changes
209
 
210
- * **Expected Changes & Future Improvements:**
211
- 1. **Re-evaluate Fine-Tuning Strategy: πŸ€”** Given results, embedding fine-tuning needs review. This could involve:
212
- * Augmenting the fine-tuning dataset or using different data generation strategies.
213
- * Changing the semantic chunking strategy to produce more targeted context
214
- which may be especially important on edge devices. This could in turn
215
  increase the importance of fine tuning.
216
- 2. **Prompt Engineering: ✍️** Refine LLM agent prompts (supervisor, RAG) for better answer synthesis. This could boost factual correctness and relevancy, regardless of embedding model.
217
- 3. **Advanced RAG Techniques: ✨** Explore methods like re-ranking, query transformations, or HyDE. The goal is to improve context quality and relevance for the LLM.
218
- 4. **LLM for Generation: 🧠** Experiment with different LLMs for answer generation. `evaluate_rag.ipynb` uses `gpt-4.1-nano` (RAG - for efficiency) and `gpt-4.1-mini` (evaluator); `app.py` uses `gpt-4.1-mini`. Consistency or a more powerful model might improve results.
219
- 5. A more complex agent team. Possibilities:
220
 
221
- - LLM that writes queries for tools based on previous messages,
222
- - Writing team that can develop a presentation based on the produced research results.
223
- - A "highlighter" that can identify the object of discussion in the frame and circle it.
224
-
225
- - 6. A more complex ingestion pipeline, that is able to transcribe and OCR videos even when they are not accompanied by the transcripts.
226
-
227
 
 
 
2
 
3
  Marko Budisic
4
 
5
+ ### Deliverables
6
 
7
  1. [Main Github repo](https://github.com/mbudisic/pstuts-rag)
8
  1. [Github repo for creating the golden dataset](https://github.com/adobe-research/PsTuts-VQA-Dataset)
 
13
  1. [Corpus dataset](https://huggingface.co/datasets/mbudisic/PsTuts-VQA)
14
  1. [Golden Q&A dataset](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa)
15
 
16
+ ### ToC
17
 
18
  - [Certification Challenge](#certification-challenge)
19
+ - [Deliverables](#deliverables)
20
+ - [ToC](#toc)
21
+ - [Task 1: Defining your Problem and Audience](#task-1-defining-your-problem-and-audience)
22
+ - [Task 2: Propose a Solution](#task-2-propose-a-solution)
23
+ - [Task 3: Dealing with the Data](#task-3-dealing-with-the-data)
24
+ - [3.1. Data Sources \& External APIs πŸ“Š+🌐](#31-data-sources--external-apis-)
25
+ - [3.2. Chunking Strategy πŸ§ βœ‚οΈ](#32-chunking-strategy-️)
26
+ - [3.3. Specific Data Needs for Other Parts 🧩](#33-specific-data-needs-for-other-parts-)
27
+ - [Task 4: Building a Quick End-to-End Prototype](#task-4-building-a-quick-end-to-end-prototype)
28
+ - [4.1. The Prototype Application πŸ–₯️](#41-the-prototype-application-️)
29
+ - [4.2. Deployment πŸš€ (Hugging Face Space)](#42-deployment--hugging-face-space)
30
+ - [Task 5: Creating a Golden Test Data Set](#task-5-creating-a-golden-test-data-set)
31
+ - [5.1. RAGAS Framework Assessment \& Results πŸ“Š](#51-ragas-framework-assessment--results-)
32
+ - [Task 6: Fine-Tuning Open-Source Embeddings](#task-6-fine-tuning-open-source-embeddings)
33
+ - [6.1. Fine-Tuning Process and Model Link πŸ”—:\*\*](#61-fine-tuning-process-and-model-link-)
34
+ - [Task 7: Assessing Performance](#task-7-assessing-performance)
35
+ - [7.1. Comparative RAGAS Results πŸ“Š](#71-comparative-ragas-results-)
36
+ - [8. Future changes](#8-future-changes)
37
+
38
+ ## Task 1: Defining your Problem and Audience
 
39
 
40
  **Problem:** Navigating extensive libraries of video materials to find specific information is often a time-consuming and inefficient process for users. This challenge is common in organizations that rely on video-based training materials. πŸ˜“
41
 
 
43
 
44
  _Side note: This is a good approximation of a problem that I am internally solving for my company. The agentic RAG will be augmented further for the demo day._
45
 
46
+ ## Task 2: Propose a Solution
47
 
48
  **Our Solution:** πŸ—£οΈ An agentic Retrieval Augmented Generation (RAG) system designed to answer questions about a company's video tutorial library (e.g., for software like Adobe Photoshop, or any internal training content). Users interact via a chat interface (Chainlit, as seen in `app.py`). πŸ’» The system queries its knowledge base of tutorial transcripts and can use Tavily for web searches, providing comprehensive answers relevant to the specific video library and serving up videos at the referenced timestampes. 🌐
49
 
 
51
  key frames in the video to further enhance the context.
52
  Users would be able to search not only by a query, but also by a screenshot (e.g. looking up
53
  live video if they have only a screenshot in a company walkthrough).
54
+ The agents would not only be able to answer the queries, but also develop a
55
  short presentation, e.g., in `reveal.js` or `remark`.
56
 
57
  **The Tech Stack πŸ› οΈ:**
58
 
59
+ - **LLM:** OpenAI model (`gpt-4.1-mini`), selected for strong language capabilities and ease of API access. 🧠
60
+ - **Embedding Model:** An open-source model, `Snowflake/snowflake-arctic-embed-s` (see `Fine_Tuning_Embedding_for_PSTuts.ipynb`), fine-tuned for domain-specific relevance. This is a small model trainable on a laptop. ❄️
61
+ - **Orchestration:** LangChain & LangGraph, for building the RAG application and managing agent workflows. Many functions have been stored in the `pstuts_rag` package to allow calling from notebooks and app. πŸ”—
62
+ - **Vector Database:** Qdrant (`pstuts_rag/datastore.py`), for efficient semantic search of tutorial transcripts. I had most experience with it, and no reason to look elsewhere. πŸ’Ύ
63
+ - **Evaluation:** Synthetic data set, [created using RAGAS in a second repo](https://github.com/mbudisic/PsTuts-VQA-Data-Operations), powers `evaluate_rag.ipynb`, for assessing RAG pipeline (w/o the search powers) quality. 🧐
64
+ - **Monitoring:** W&B (Weights & Biases)πŸ‹οΈ was used to monitor fine-tuning. LangSmith was enabled for monitoring in general.πŸ“Š
65
+ - **User Interface:** Chainlit chat with on-demand display of videos positioned at the correct timestamp. πŸ’¬ πŸ“Ό
66
+ - **Serving & Inference:** Docker (`Dockerfile`), for containerized deployment on Hugging Face Spaces. 🐳
67
 
68
+ **The Role of Agents πŸ•΅οΈβ€β™‚οΈ:**
69
 
70
  The system uses a LangGraph-orchestrated multi-agent approach:
71
+
72
+ 1. **Supervisor Agent:** Manages the overall workflow. It receives the user query and routes it to the appropriate specialized agent based on its interpretation of the query (defined in `SUPERVISOR_SYSTEM` prompt and `create_team_supervisor` in `app.py`). πŸ§‘β€βœˆοΈ
73
+ 2. **Video Archive Agent (`VIDEOARCHIVE`):** This is the RAG agent. It queries the Qdrant vector store of Photoshop tutorial transcripts to find relevant information and generates an answer based on this retrieved context. (Uses `create_rag_node` from `pstuts_rag.agent_rag`). πŸ“Ό
74
+ 3. **Adobe Help Agent (`ADOBEHELP`):** This agent uses the Tavily API to perform web searches if the supervisor deems it necessary for broader or more current information. (Uses `create_tavily_node` from `pstuts_rag.agent_tavily`). 🌍
75
  The supervisor then determines if the answer is complete or if further steps are needed. βœ…
76
 
77
  ```
 
92
  +-----------+ +--------------------+ +---------+
93
  ```
94
 
95
+ ## Task 3: Dealing with the Data
96
 
97
+ ### 3.1. Data Sources & External APIs πŸ“Š+🌐
98
 
99
+ - **Primary Data:** [PsTuts-VQA](https://github.com/adobe-research/PsTuts-VQA-Dataset) is a publicly-released set of transcripts linked to a database of Adobe-created Photoshop training videos. Data is in a JSON format, made available on [hf.co:mbudisic/PsTuts-VQA](https://huggingface.co/datasets/mbudisic/PsTuts-VQA). πŸ“
100
+ - **External API:** Tavily Search API (configured in `app.py`) augments knowledge with web search results of domain [helpx.adobe.com](https://helpx.adobe.com) via the `ADOBEHELP` agent for current or broader topics not covered in the internal videos. πŸ”
101
 
102
+ ### 3.2. Chunking Strategy πŸ§ βœ‚οΈ
103
 
104
  (see: `pstuts_rag/datastore.py`'s `chunk_transcripts` function and `pstuts_rag/loader.py`)
105
 
 
108
  be overlaid on the screen.
109
 
110
  Therefore, to achieve a useful semantic chunking for RAG, the following **semantic chunking** strategy is employed:
111
+
112
+ 1. **Initial Loading:** Transcripts are loaded both entirely per video (`VideoTranscriptBulkLoader`) and as individual sentences/segments with timestamps (`VideoTranscriptChunkLoader`).
113
+ 2. **Semantic Splitting:** `SemanticChunker` (LangChain, using `OpenAIEmbeddings`) splits full transcripts into semantically coherent chunks.
114
+ 3. **Metadata Enrichment:** These semantic chunks are enriched with start/end times by mapping them back to the original timestamped sentences.
115
 
116
  **In summary:** πŸ€” This method (a) creates topically focused chunks for better retrieval. 🎯 (b) links back to video timestamps. πŸ”—
117
 
118
+ ### 3.3. Specific Data Needs for Other Parts 🧩
119
 
120
+ - **Evaluation & Golden Dataset (Tasks 5 & 7):** πŸ† Generating the "Golden Data Set" using Knowledge Graph to produce question-answer-context triplet in RAGAS is detailed in `create_golden_dataset.ipynb` (see [`PsTuts-VQA-Data-Operations` repo](https://github.com/mbudisic/PsTuts-VQA-Data-Operations)). The resulting dataset [hf.co:mbudisic/pstuts_rag_-_qa](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa) is used to benchmark the RAG pipeline in `evaluate_rag.ipynb` and fine-tune the embedding model. πŸ“Š
121
 
122
+ - **Embedding Model Fine-Tuning (Task 6):** πŸ”¬ The `Fine_Tuning_Embedding_for_PSTuts.ipynb` notebook shows the use of [`hf.co:mbudisic/pstuts_rag_qa`](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa) to fine-tune the embedding model. This adapts models like `Snowflake/snowflake-arctic-embed-s` for improved retrieval. βš™οΈ
123
 
124
+ ## Task 4: Building a Quick End-to-End Prototype
125
 
126
  An end-to-end prototype RAG system for Photoshop tutorials is built and deployed to HF.
127
 
128
+ ### 4.1. The Prototype Application πŸ–₯️
129
 
130
  The `app.py` script is the core prototype. It uses Chainlit for the UI, LangChain/LangGraph for orchestration, Qdrant for the vector store, and OpenAI models for embeddings and generation. It loads data, builds the RAG chain, and manages the agentic workflow for user queries. ✨
131
 
132
+ ### 4.2. Deployment πŸš€ (Hugging Face Space)
133
 
134
  The repository is structured for Hugging Face Space deployment:
135
+ - `README.md` contains Hugging Face Space metadata (e.g., `sdk: docker`).
136
+ - A `Dockerfile` enables containerization for deployment.
137
  This setup indicates the prototype is packaged for public deployment. 🌍
138
 
139
+ ## Task 5: Creating a Golden Test Data Set
140
 
141
  The creation of the "Golden Test Data Set" is documented in the `create_golden_dataset.ipynb` notebook in the [`PsTuts-VQA-Data-Operations` repository](https://github.com/mbudisic/PsTuts-VQA-Data-Operations). This dataset was then utilized in the `notebooks/evaluate_rag.ipynb` of the current project to assess the initial RAG pipeline with RAGAS. 🌟
142
 
143
+ ### 5.1. RAGAS Framework Assessment & Results πŸ“Š
144
 
145
  The initial RAG pipeline ("Base" model, `Snowflake/snowflake-arctic-embed-s` before fine-tuning) yielded these mean RAGAS scores:
146
 
 
152
  | Factual Correctness (mode=f1) | 0.654 |
153
  | Context Entity Recall | 0.636 |
154
 
155
+ _(Scores from `notebooks/evaluate_rag.ipynb` output for the "Base" configuration)_
156
 
157
  **2. Conclusions on Performance and Effectiveness 🧐:**
158
 
159
+ - **Strengths:** πŸ’ͺ High **Answer Relevancy (0.914)** indicates the system understands queries well.
160
+ - **Areas for Improvement:** πŸ“‰
161
+ - **Faithfulness (0.721):** Answers are not always perfectly grounded in retrieved context. Maybe if I turned the temperature down to 0 this score would have been higher.
162
+ - **Context Recall (0.672):** Not all necessary information is always retrieved.
163
+ - **Factual Correctness (0.654):** Factual accuracy of answers needs improvement.
164
+ - **Overall:** The baseline system is good at relevant responses but needs better context retrieval and factual accuracy. This benchmarks a clear path for improvements, such as embedding fine-tuning. πŸ› οΈ
165
 
166
+ ## Task 6: Fine-Tuning Open-Source Embeddings
167
 
168
  To enhance retrieval performance for a specific video library, an open-source embedding model can be fine-tuned on domain-specific data. The following describes an example of this process using Photoshop tutorial data. πŸ§ͺ
169
 
170
+ ### 6.1. Fine-Tuning Process and Model Link πŸ”—:**
171
 
172
+ - **Base Model:** `Snowflake/snowflake-arctic-embed-s` was chosen as the base model for fine-tuning in this example. The `-s` stands for small -- larger two models ended up taking too much GPU memory on my laptop. ❄️
173
+ - **Fine-tuning Data:** The fine-tuning notebook is `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`. It uses the golden dataset, retrieved from the HF repository. πŸ–ΌοΈ The data was split into `train`-`validate`-`test` blocks. `train` was used
174
  to compute the objective function in the training loop, while `validate` was used in evaluation.
175
+ - **Monitoring:** πŸ› οΈ W&B tracked the process and evaluation. πŸ“ˆ
176
+ - **Resulting Model:** The fine-tuned model (for the Photoshop example) was saved and pushed to the Hugging Face Hub. πŸ€— [mbudisic/snowflake-arctic-embed-s-ft-pstuts](https://huggingface.co/mbudisic/snowflake-arctic-embed-s-ft-pstuts)
177
 
178
+ _(Evidence for this is in `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`, specifically the `model.push_to_hub` call and its output. The `app.py` can be (or is) configured to use such a fine-tuned model for the embedding step in the RAG pipeline.)_
179
 
180
+ ## Task 7: Assessing Performance
181
 
182
  Performance of the RAG application with the fine-tuned embedding model (`mbudisic/snowflake-arctic-embed-s-ft-pstuts`) was assessed using the same RAGAS framework and "Golden Data Set" (`golden_small_hf`) as the baseline. πŸ†
183
 
184
+ ### 7.1. Comparative RAGAS Results πŸ“Š
185
 
186
  (see: `notebooks/evaluate_rag.ipynb` output)
187
 
 
204
 
205
  The Hugging Face live demo runs the fine-tuned model.
206
 
207
+ _(Note: These are mean scores. `Factual Correctness` is `factual_correctness(mode=f1)` in the notebook.)_
208
 
209
+ ## 8. Future changes
210
 
211
+ - **Expected Changes & Future Improvements:**
212
+ 1. **Re-evaluate Fine-Tuning Strategy: πŸ€”** Given results, embedding fine-tuning needs review. This could involve:
213
+ - Augmenting the fine-tuning dataset or using different data generation strategies.
214
+ - Changing the semantic chunking strategy to produce more targeted context
215
+ which may be especially important on edge devices. This could in turn
216
  increase the importance of fine tuning.
217
+ 2. **Prompt Engineering: ✍️** Refine LLM agent prompts (supervisor, RAG) for better answer synthesis. This could boost factual correctness and relevancy, regardless of embedding model.
218
+ 3. **Advanced RAG Techniques: ✨** Explore methods like re-ranking, query transformations, or HyDE. The goal is to improve context quality and relevance for the LLM.
219
+ 4. **LLM for Generation: 🧠** Experiment with different LLMs for answer generation. `evaluate_rag.ipynb` uses `gpt-4.1-nano` (RAG - for efficiency) and `gpt-4.1-mini` (evaluator); `app.py` uses `gpt-4.1-mini`. Consistency or a more powerful model might improve results.
220
+ 5. A more complex agent team. Possibilities:
221
 
222
+ - LLM that writes queries for tools based on previous messages,
223
+ - Writing team that can develop a presentation based on the produced research results.
224
+ - A "highlighter" that can identify the object of discussion in the frame and circle it.
 
 
 
225
 
226
+ - 6. A more complex ingestion pipeline, that is able to transcribe and OCR videos even when they are not accompanied by the transcripts.