mbudisic commited on
Commit
4b5524f
Β·
1 Parent(s): cd9b685

Updated version of the answer

Browse files
Files changed (1) hide show
  1. ANSWER.md +114 -60
ANSWER.md CHANGED
@@ -1,77 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Task 1: Defining your Problem and Audience
2
 
3
- **Problem:** Locating specific Photoshop information in long video tutorial transcripts is difficult and time-consuming.
 
 
4
 
5
- **Users and their Problem:** Photoshop learners (designers, photographers, students, hobbyists) often struggle with inefficiently searching video tutorials for specific techniques. They need a quick way to query tutorial content for direct, concise answers, saving time and reducing learning frustration.
6
 
7
  # Task 2: Propose a Solution
8
 
9
- **Our Solution:** An agentic Retrieval Augmented Generation (RAG) system answers Adobe Photoshop questions. Users interact via a chat interface (Chainlit, as seen in `app.py`). The system queries its tutorial transcript knowledge base and can use Tavily for web searches, providing comprehensive answers.
10
 
11
- **The Tech Stack πŸ› οΈ:** (Primary sources: `app.py`, `pstuts_rag/datastore.py`, `pyproject.toml`, `README.md`)
 
 
 
 
 
12
 
13
- * **LLM:** OpenAI model (e.g., `gpt-4.1-mini` in `app.py`), selected for strong language capabilities.
14
- * **Embedding Model:** An open-source model, `Snowflake/snowflake-arctic-embed-s` (see `Fine_Tuning_Embedding_for_PSTuts.ipynb`), fine-tuned for domain-specific relevance.
15
- * **Orchestration:** LangChain & LangGraph (`app.py`), for building the RAG application and managing agent workflows.
16
- * **Vector Database:** Qdrant (`pstuts_rag/datastore.py`), for efficient semantic search of tutorial transcripts.
17
- * **Monitoring:** W&B (Weights & Biases) is present in `notebooks/` and `Fine_Tuning_Embedding_for_PSTuts.ipynb`, used for experiment tracking during development.
18
- * **Evaluation:** RAGAS (`evaluate_rag.ipynb`, `pyproject.toml`), for assessing RAG pipeline quality.
19
- * **User Interface:** Chainlit (`app.py`, `chainlit.md`), for creating the interactive chat application.
20
- * **Serving & Inference:** Docker (`Dockerfile`), for containerized deployment (e.g., on Hugging Face Spaces, as suggested in `README.md` metadata).
21
 
22
- **The Role of Agents πŸ•΅οΈβ€β™‚οΈ:** (Primary source: `app.py`)
 
 
 
 
 
 
 
 
 
23
 
24
  The system uses a LangGraph-orchestrated multi-agent approach:
25
- 1. **Supervisor Agent:** Manages the overall workflow. It receives the user query and routes it to the appropriate specialized agent based on its interpretation of the query (defined in `SUPERVISOR_SYSTEM` prompt and `create_team_supervisor` in `app.py`).
26
- 2. **Video Archive Agent (`VIDEOARCHIVE`):** This is the RAG agent. It queries the Qdrant vector store of Photoshop tutorial transcripts to find relevant information and generates an answer based on this retrieved context. (Uses `create_rag_node` from `pstuts_rag.agent_rag`).
27
- 3. **Adobe Help Agent (`ADOBEHELP`):** This agent uses the Tavily API to perform web searches if the supervisor deems it necessary for broader or more current information. (Uses `create_tavily_node` from `pstuts_rag.agent_tavily`).
28
- The supervisor then determines if the answer is complete or if further steps are needed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  # Task 3: Dealing with the Data
31
 
32
- Our Photoshop RAG system uses specific data and chunking for accurate, relevant answers.
 
 
 
33
 
34
- **1. Data Sources & External APIs πŸ“Š+🌐:**
35
 
36
- * **Primary Data Source:** JSON transcript files from Photoshop video tutorials (e.g., `data/dev.json`, loaded in `app.py`). *Purpose:* Core knowledge base, processed and indexed in Qdrant for semantic search.
37
- * **External API:** Tavily Search API (configured in `app.py`). *Purpose:* Augments knowledge with web search results via the `ADOBEHELP` agent for current or broader topics.
38
 
39
- **2. Default Chunking Strategy πŸ§ βœ‚οΈ:** (Source: `pstuts_rag/datastore.py`'s `chunk_transcripts` function)
 
 
40
 
41
- A **semantic chunking** strategy is employed:
42
  1. **Initial Loading:** Transcripts are loaded both entirely per video (`VideoTranscriptBulkLoader`) and as individual sentences/segments with timestamps (`VideoTranscriptChunkLoader`).
43
  2. **Semantic Splitting:** `SemanticChunker` (LangChain, using `OpenAIEmbeddings`) splits full transcripts into semantically coherent chunks.
44
  3. **Metadata Enrichment:** These semantic chunks are enriched with start/end times by mapping them back to the original timestamped sentences.
45
 
46
- * **Why this Strategy?** Ensures topically focused chunks for better retrieval relevance, provides richer context to the LLM, and allows linking back to video timestamps.
 
 
47
 
48
- **3. [Optional] Specific Data Needs for Other Parts 🧩:**
49
 
50
- * **Embedding Model Fine-Tuning (Task 6):** The `Fine_Tuning_Embedding_for_PSTuts.ipynb` notebook generated/used a question-passage dataset from Photoshop tutorials (detailed in `dataset_card.md`) to adapt the `Snowflake/snowflake-arctic-embed-s` model for better Photoshop-specific retrieval.
51
- * **Evaluation & Golden Dataset (Tasks 5 & 7):** The process for generating the "Golden Data Set" (question-context-answer triplets) used for RAGAS evaluation is detailed in the `create_golden_dataset.ipynb` notebook within the `PsTuts-VQA-Data-Operations` repository ([https://github.com/mbudisic/PsTuts-VQA-Data-Operations](https://github.com/mbudisic/PsTuts-VQA-Data-Operations)). This dataset, subsequently referred to as `golden_small_hf` on Hugging Face, was then used in the main project's `evaluate_rag.ipynb` for benchmarking.
52
 
53
  # Task 4: Building a Quick End-to-End Prototype
54
 
55
- An end-to-end prototype RAG system for Photoshop tutorials is built and deployable.
56
 
57
- **1. The Prototype Application πŸ–₯️:** (Source: `app.py`)
58
 
59
- The `app.py` script is the core prototype. It uses Chainlit for the UI, LangChain/LangGraph for orchestration, Qdrant for the vector store, and OpenAI models for embeddings and generation. It loads data, builds the RAG chain, and manages the agentic workflow for user queries.
60
 
61
- **2. Deployment πŸš€ (Hugging Face Space):**
62
 
63
  The repository is structured for Hugging Face Space deployment:
64
  * `README.md` contains Hugging Face Space metadata (e.g., `sdk: docker`).
65
  * A `Dockerfile` enables containerization for deployment.
66
- This setup indicates the prototype is packaged for public deployment.
67
 
68
  # Task 5: Creating a Golden Test Data Set
69
 
70
- The creation of the "Golden Test Data Set" is documented in the `create_golden_dataset.ipynb` notebook in the `PsTuts-VQA-Data-Operations` repository ([https://github.com/mbudisic/PsTuts-VQA-Data-Operations](https://github.com/mbudisic/PsTuts-VQA-Data-Operations)). This dataset (named `golden_small_hf` on Hugging Face) was then utilized in the `notebooks/evaluate_rag.ipynb` of the current project to assess the initial RAG pipeline with RAGAS.
71
 
72
- **1. RAGAS Framework Assessment & Results πŸ“Š:**
73
 
74
- The initial RAG pipeline ("Base" model, likely `text-embedding-3-small` before fine-tuning) yielded these mean RAGAS scores:
75
 
76
  | Metric | Mean Score |
77
  |---------------------------------|------------|
@@ -85,33 +133,35 @@ The initial RAG pipeline ("Base" model, likely `text-embedding-3-small` before f
85
 
86
  **2. Conclusions on Performance and Effectiveness 🧐:**
87
 
88
- * **Strengths:** High **Answer Relevancy (0.914)** indicates the system understands queries well.
89
- * **Areas for Improvement:**
90
  * **Faithfulness (0.721):** Answers are not always perfectly grounded in retrieved context.
91
  * **Context Recall (0.672):** Not all necessary information is always retrieved.
92
  * **Factual Correctness (0.654):** Factual accuracy of answers needs improvement.
93
- * **Overall:** The baseline system is good at relevant responses but needs better context retrieval and factual accuracy. This benchmarks a clear path for improvements, such as embedding fine-tuning.
94
 
95
  # Task 6: Fine-Tuning Open-Source Embeddings
96
 
97
- To enhance retrieval performance, an open-source embedding model was fine-tuned on domain-specific data.
98
 
99
- **1. Fine-Tuning Process and Model Link πŸ”—:**
100
 
101
- * **Base Model:** `Snowflake/snowflake-arctic-embed-s` was chosen as the base model for fine-tuning.
102
- * **Fine-tuning Data:** A specialized dataset of (question, relevant_document_passage) pairs derived from the Photoshop tutorials was generated/used, as detailed in `dataset_card.md` and implemented in `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`.
103
- * **Process:** The fine-tuning was performed using the `sentence-transformers` library, with training objectives designed to improve the model's ability to map Photoshop-related queries to relevant transcript passages. The process and evaluation were tracked using W&B.
104
- * **Resulting Model:** The fine-tuned model was saved and pushed to the Hugging Face Hub.
105
- * **Hugging Face Hub Link:** The fine-tuned embedding model is available at:
106
  [mbudisic/snowflake-arctic-embed-s-ft-pstuts](https://huggingface.co/mbudisic/snowflake-arctic-embed-s-ft-pstuts)
107
 
108
- *(Evidence for this is in `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`, specifically the `model.push_to_hub` call and its output. The `app.py` can be (or is) configured to use this fine-tuned model for the embedding step in the RAG pipeline.)*
109
 
110
  # Task 7: Assessing Performance
111
 
112
- Performance of the RAG application with the fine-tuned embedding model (`mbudisic/snowflake-arctic-embed-s-ft-pstuts`) was assessed using the same RAGAS framework and "Golden Data Set" (`golden_small_hf`) as the baseline.
113
 
114
- **1. Comparative RAGAS Results πŸ“Š:** (Source: `notebooks/evaluate_rag.ipynb` output)
 
 
115
 
116
  The notebook provides a comparison between "Base", "SOTA" (OpenAI's `text-embedding-3-small`), and "FT" (our fine-tuned `mbudisic/snowflake-arctic-embed-s-ft-pstuts`) models.
117
 
@@ -125,22 +175,26 @@ The notebook provides a comparison between "Base", "SOTA" (OpenAI's `text-embedd
125
 
126
  *(Note: These are mean scores. `Factual Correctness` is `factual_correctness(mode=f1)` in the notebook.)*
127
 
128
- **2. Conclusions on Fine-Tuned Performance & Future Changes πŸš€:**
129
 
130
  * **Impact of Fine-Tuning:**
131
- * **Faithfulness (+0.027):** A slight improvement, suggesting answers from the fine-tuned model are marginally more grounded in the retrieved context.
132
- * **Answer Relevancy (-0.095):** Surprisingly, answer relevancy decreased. This might indicate that while the fine-tuned model is better at finding *technically* similar content based on Photoshop jargon, the overall answer framing by the LLM became less aligned with the user's original question intent compared to the broader base model.
133
- * **Context Recall (No Change):** The ability to retrieve all necessary information did not change. The notebook itself notes: "What we see is that there is no difference in context recall... My guess is that this result has to do with the specific application. These were audio transcripts of fairly short videos. Most transcripts therefore fit completely into a single, or a few, chunks... even a base embedding model likely did as good of a job as it could."
134
- * **Factual Correctness (-0.056):** This also saw a decrease, which is concerning and counter-intuitive for a fine-tuning step aimed at domain specificity.
135
- * **Overall Assessment of Fine-Tuning:** The fine-tuning of `Snowflake/snowflake-arctic-embed-s` showed mixed results. While faithfulness slightly improved, the key metrics of answer relevancy and factual correctness unexpectedly declined. Context recall remained unchanged, which the notebook speculates might be due to the nature of the data (short, distinct transcripts). The notebook author concludes: "So, in the end, the conclusion is that the embedding model is not the right spot to optimize this RAG chain." for this specific dataset and base embedding model.
 
 
 
 
136
  * **Expected Changes & Future Improvements:**
137
- 1. **Re-evaluate Fine-Tuning Strategy:** Given the results, the fine-tuning approach for embeddings needs review. This could involve:
138
- * Trying a different base model for fine-tuning (perhaps a larger one, or one known for better transfer learning on smaller datasets).
139
  * Augmenting the fine-tuning dataset or using different data generation strategies.
140
  * Adjusting fine-tuning hyperparameters.
141
- 2. **Prompt Engineering:** Focus on refining the prompts used for the LLM agents (supervisor, RAG agent) to better guide answer synthesis, potentially improving factual correctness and answer relevancy irrespective of embedding model changes.
142
- 3. **Advanced RAG Techniques:** Explore techniques like re-ranking retrieved documents, query transformations, or hypothetical document embeddings (HyDE) to improve the quality and relevance of context fed to the LLM.
143
- 4. **LLM for Generation:** Experiment with different LLMs for the answer generation step. The `evaluate_rag.ipynb` uses `gpt-4.1-nano` for the LLM in RAG chains and `gpt-4.1-mini` for the evaluator LLM. The main `app.py` uses `gpt-4.1-mini`. Consistency or using a more powerful generation model might yield better results.
144
- 5. **Iterative Evaluation:** Continue using the RAGAS framework on the golden dataset to meticulously track the impact of each change.
145
 
146
  This concludes the update to `ANSWER.md` based on your instructions.
 
1
+ # Certification Challenge
2
+
3
+ Marko Budisic
4
+
5
+ ## Deliverables:
6
+
7
+ 1. [Main Github repo]
8
+ 2. [Github repo for creating the golden dataset](https://github.com/adobe-research/PsTuts-VQA-Dataset)
9
+ 3. [Loom video]()
10
+ 4. [Written document](https://github.com/mbudisic/pstuts-rag/blob/main/ANSWER.md)
11
+ 5. [Hugging Face live demo](https://huggingface.co/spaces/mbudisic/PsTuts-RAG)
12
+ 6. [Fine tuned embedding model](https://huggingface.co/mbudisic/snowflake-arctic-embed-s-ft-pstuts)
13
+ 7. [Corpus dataset](https://huggingface.co/datasets/mbudisic/PsTuts-VQA)
14
+ 8. [Golden Q&A dataset](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa)
15
+
16
+
17
  # Task 1: Defining your Problem and Audience
18
 
19
+ **Problem:** Navigating extensive libraries of video materials to find specific information is often a time-consuming and inefficient process for users. This challenge is common in organizations that rely on video-based training materials. πŸ˜“
20
+
21
+ **Users and their Problem:** 🏒 Companies often have extensive video tutorial libraries for proprietary software. Employees (new hires, support, experienced users) struggle to quickly find specific instructions within these videos. 🎯 Like Photoshop learners needing a specific technique, employees need a fast way to query video content, saving time and boosting learning. πŸš€
22
 
23
+ _Side note: This is a good approximation of a problem that I am internally solving for my company. The agentic RAG will be augmented further for the demo day._
24
 
25
  # Task 2: Propose a Solution
26
 
27
+ **Our Solution:** πŸ—£οΈ An agentic Retrieval Augmented Generation (RAG) system designed to answer questions about a company's video tutorial library (e.g., for software like Adobe Photoshop, or any internal training content). Users interact via a chat interface (Chainlit, as seen in `app.py`). πŸ’» The system queries its knowledge base of tutorial transcripts and can use Tavily for web searches, providing comprehensive answers relevant to the specific video library and serving up videos at the referenced timestampes. 🌐
28
 
29
+ Broader vision is to build an ingestion pipeline that would transcribe audio narration and OCR
30
+ key frames in the video to further enhance the context.
31
+ Users would be able to search not only by a query, but also by a screenshot (e.g. looking up
32
+ live video if they have only a screenshot in a company walkthrough).
33
+ The agents would not only be able to answer the queries, but also develop a
34
+ short presentation, e.g., in `reveal.js` or `remark`.
35
 
36
+ **The Tech Stack πŸ› οΈ:**
 
 
 
 
 
 
 
37
 
38
+ * **LLM:** OpenAI model (`gpt-4.1-mini`), selected for strong language capabilities and ease of API access. 🧠
39
+ * **Embedding Model:** An open-source model, `Snowflake/snowflake-arctic-embed-s` (see `Fine_Tuning_Embedding_for_PSTuts.ipynb`), fine-tuned for domain-specific relevance. This is a small model trainable on a laptop. ❄️
40
+ * **Orchestration:** LangChain & LangGraph, for building the RAG application and managing agent workflows. Many functions have been stored in the `pstuts_rag` package to allow calling from notebooks and app. πŸ”—
41
+ * **Vector Database:** Qdrant (`pstuts_rag/datastore.py`), for efficient semantic search of tutorial transcripts. I had most experience with it, and no reason to look elsewhere. πŸ’Ύ
42
+ * **Evaluation:** Synthetic data set, [created using RAGAS in a second repo](https://github.com/mbudisic/PsTuts-VQA-Data-Operations), powers `evaluate_rag.ipynb`, for assessing RAG pipeline (w/o the search powers) quality. 🧐
43
+ * **Monitoring:** W&B (Weights & Biases)πŸ‹οΈ was used to monitor fine-tuning. LangSmith was enabled for monitoring in general.πŸ“Š
44
+ * **User Interface:** Chainlit chat with on-demand display of videos positioned at the correct timestamp. πŸ’¬ πŸ“Ό
45
+ * **Serving & Inference:** Docker (`Dockerfile`), for containerized deployment on Hugging Face Spaces. 🐳
46
+
47
+ **The Role of Agents πŸ•΅οΈβ€β™‚οΈ:**
48
 
49
  The system uses a LangGraph-orchestrated multi-agent approach:
50
+ 1. **Supervisor Agent:** Manages the overall workflow. It receives the user query and routes it to the appropriate specialized agent based on its interpretation of the query (defined in `SUPERVISOR_SYSTEM` prompt and `create_team_supervisor` in `app.py`). πŸ§‘β€βœˆοΈ
51
+ 2. **Video Archive Agent (`VIDEOARCHIVE`):** This is the RAG agent. It queries the Qdrant vector store of Photoshop tutorial transcripts to find relevant information and generates an answer based on this retrieved context. (Uses `create_rag_node` from `pstuts_rag.agent_rag`). πŸ“Ό
52
+ 3. **Adobe Help Agent (`ADOBEHELP`):** This agent uses the Tavily API to perform web searches if the supervisor deems it necessary for broader or more current information. (Uses `create_tavily_node` from `pstuts_rag.agent_tavily`). 🌍
53
+ The supervisor then determines if the answer is complete or if further steps are needed. βœ…
54
+
55
+ ```
56
+ +-----------+
57
+ | __start__ |
58
+ +-----------+
59
+ *
60
+ *
61
+ *
62
+ +------------+
63
+ | supervisor |
64
+ *****+------------+.....
65
+ **** . ....
66
+ ***** . .....
67
+ *** . ...
68
+ +-----------+ +--------------------+ +---------+
69
+ | AdobeHelp | | VideoArchiveSearch | | __end__ |
70
+ +-----------+ +--------------------+ +---------+
71
+ ```
72
 
73
  # Task 3: Dealing with the Data
74
 
75
+ ## 3.1. Data Sources & External APIs πŸ“Š+🌐:
76
+
77
+ * **Primary Data:** [PsTuts-VQA](https://github.com/adobe-research/PsTuts-VQA-Dataset) is a publicly-released set of transcripts linked to a database of Adobe-created Photoshop training videos. Data is in a JSON format, made available on [hf.co:mbudisic/PsTuts-VQA](https://huggingface.co/datasets/mbudisic/PsTuts-VQA). πŸ“
78
+ * **External API:** Tavily Search API (configured in `app.py`) augments knowledge with web search results of domain [helpx.adobe.com](https://helpx.adobe.com) via the `ADOBEHELP` agent for current or broader topics not covered in the internal videos. πŸ”
79
 
80
+ ## 3.2. Chunking Strategy πŸ§ βœ‚οΈ:
81
 
82
+ (see: `pstuts_rag/datastore.py`'s `chunk_transcripts` function and `pstuts_rag/loader.py`)
 
83
 
84
+ Transcript chunks in the input dataset are too granular - often a sentence or two,
85
+ since they are tied to the time windows in which a particular transcript sentence would
86
+ be overlaid on the screen.
87
 
88
+ Therefore, to achieve a useful semantic chunking for RAG, the following **semantic chunking** strategy is employed:
89
  1. **Initial Loading:** Transcripts are loaded both entirely per video (`VideoTranscriptBulkLoader`) and as individual sentences/segments with timestamps (`VideoTranscriptChunkLoader`).
90
  2. **Semantic Splitting:** `SemanticChunker` (LangChain, using `OpenAIEmbeddings`) splits full transcripts into semantically coherent chunks.
91
  3. **Metadata Enrichment:** These semantic chunks are enriched with start/end times by mapping them back to the original timestamped sentences.
92
 
93
+ **In summary:** πŸ€” This method (a) creates topically focused chunks for better retrieval. 🎯 (b) links back to video timestamps. πŸ”—
94
+
95
+ ## 3.3. Specific Data Needs for Other Parts 🧩:
96
 
97
+ * **Evaluation & Golden Dataset (Tasks 5 & 7):** πŸ† Generating the "Golden Data Set" (Q-C-A triplets) for RAGAS is detailed in `create_golden_dataset.ipynb` (see [`PsTuts-VQA-Data-Operations` repo](https://github.com/mbudisic/PsTuts-VQA-Data-Operations)). The resulting dataset [hf.co:mbudisic/pstuts_rag_-_qa](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa) is used to benchmark the RAG pipeline in `evaluate_rag.ipynb`. πŸ“Š
98
 
99
+ * **Embedding Model Fine-Tuning (Task 6):** πŸ”¬ The `Fine_Tuning_Embedding_for_PSTuts.ipynb` notebook shows the use of [hf.co:mbudisic/pstuts_rag_-_qa](https://huggingface.co/datasets/mbudisic/pstuts_rag_qa) tp fine-tune the embedding model. This adapts models like `Snowflake/snowflake-arctic-embed-s` for improved retrieval. βš™οΈ
 
100
 
101
  # Task 4: Building a Quick End-to-End Prototype
102
 
103
+ An end-to-end prototype RAG system for Photoshop tutorials is built and deployed to HF.
104
 
105
+ ## 4.1. The Prototype Application πŸ–₯️:
106
 
107
+ The `app.py` script is the core prototype. It uses Chainlit for the UI, LangChain/LangGraph for orchestration, Qdrant for the vector store, and OpenAI models for embeddings and generation. It loads data, builds the RAG chain, and manages the agentic workflow for user queries. ✨
108
 
109
+ ## 4.2. Deployment πŸš€ (Hugging Face Space):
110
 
111
  The repository is structured for Hugging Face Space deployment:
112
  * `README.md` contains Hugging Face Space metadata (e.g., `sdk: docker`).
113
  * A `Dockerfile` enables containerization for deployment.
114
+ This setup indicates the prototype is packaged for public deployment. 🌍
115
 
116
  # Task 5: Creating a Golden Test Data Set
117
 
118
+ The creation of the "Golden Test Data Set" is documented in the `create_golden_dataset.ipynb` notebook in the `PsTuts-VQA-Data-Operations` repository ([https://github.com/mbudisic/PsTuts-VQA-Data-Operations](https://github.com/mbudisic/PsTuts-VQA-Data-Operations)). This dataset (named `golden_small_hf` on Hugging Face) was then utilized in the `notebooks/evaluate_rag.ipynb` of the current project to assess the initial RAG pipeline with RAGAS. 🌟
119
 
120
+ ## 5.1. RAGAS Framework Assessment & Results πŸ“Š:
121
 
122
+ The initial RAG pipeline ("Base" model, `Snowflake/snowflake-arctic-embed-s` before fine-tuning) yielded these mean RAGAS scores:
123
 
124
  | Metric | Mean Score |
125
  |---------------------------------|------------|
 
133
 
134
  **2. Conclusions on Performance and Effectiveness 🧐:**
135
 
136
+ * **Strengths:** πŸ’ͺ High **Answer Relevancy (0.914)** indicates the system understands queries well.
137
+ * **Areas for Improvement:** πŸ“‰
138
  * **Faithfulness (0.721):** Answers are not always perfectly grounded in retrieved context.
139
  * **Context Recall (0.672):** Not all necessary information is always retrieved.
140
  * **Factual Correctness (0.654):** Factual accuracy of answers needs improvement.
141
+ * **Overall:** The baseline system is good at relevant responses but needs better context retrieval and factual accuracy. This benchmarks a clear path for improvements, such as embedding fine-tuning. πŸ› οΈ
142
 
143
  # Task 6: Fine-Tuning Open-Source Embeddings
144
 
145
+ To enhance retrieval performance for a specific video library, an open-source embedding model can be fine-tuned on domain-specific data. The following describes an example of this process using Photoshop tutorial data. πŸ§ͺ
146
 
147
+ ## 6.1. Fine-Tuning Process and Model Link πŸ”—:**
148
 
149
+ * **Base Model:** `Snowflake/snowflake-arctic-embed-s` was chosen as the base model for fine-tuning in this example. ❄️
150
+ * **Fine-tuning Data:** For this specific example, a specialized dataset of (question, relevant_document_passage) pairs derived from Photoshop tutorials was generated/used, as detailed in `dataset_card.md` and implemented in `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`. A similar dataset would be created for any other specific domain. πŸ–ΌοΈ
151
+ * **Process:** πŸ› οΈ Fine-tuning used `sentence-transformers` to better map domain queries (e.g., Photoshop) to transcript passages. W&B tracked the process and evaluation. πŸ“ˆ
152
+ * **Resulting Model:** The fine-tuned model (for the Photoshop example) was saved and pushed to the Hugging Face Hub. πŸ€—
153
+ * **Hugging Face Hub Link (Example):** The fine-tuned embedding model for the Photoshop tutorial example is available at:
154
  [mbudisic/snowflake-arctic-embed-s-ft-pstuts](https://huggingface.co/mbudisic/snowflake-arctic-embed-s-ft-pstuts)
155
 
156
+ *(Evidence for this is in `notebooks/Fine_Tuning_Embedding_for_PSTuts.ipynb`, specifically the `model.push_to_hub` call and its output. The `app.py` can be (or is) configured to use such a fine-tuned model for the embedding step in the RAG pipeline.)*
157
 
158
  # Task 7: Assessing Performance
159
 
160
+ Performance of the RAG application with the fine-tuned embedding model (`mbudisic/snowflake-arctic-embed-s-ft-pstuts`) was assessed using the same RAGAS framework and "Golden Data Set" (`golden_small_hf`) as the baseline. πŸ†
161
 
162
+ ## 7.1. Comparative RAGAS Results πŸ“Š:
163
+
164
+ (see: `notebooks/evaluate_rag.ipynb` output)
165
 
166
  The notebook provides a comparison between "Base", "SOTA" (OpenAI's `text-embedding-3-small`), and "FT" (our fine-tuned `mbudisic/snowflake-arctic-embed-s-ft-pstuts`) models.
167
 
 
175
 
176
  *(Note: These are mean scores. `Factual Correctness` is `factual_correctness(mode=f1)` in the notebook.)*
177
 
178
+ ## 7.2. Conclusions on Fine-Tuned Performance πŸš€:
179
 
180
  * **Impact of Fine-Tuning:**
181
+ * **Faithfulness (+0.027):** βœ… A slight improvement, suggesting answers from the fine-tuned model are marginally more grounded in the retrieved context.
182
+ * **Answer Relevancy (-0.095):** πŸ“‰ Surprisingly, relevancy decreased. While the FT model found technically similar content (e.g., Photoshop jargon), the LLM's answer framing may have become less aligned with user intent versus the base model.
183
+ * **Context Recall (No Change):** πŸ€·β€β™‚οΈ Retrieval ability remained static. The notebook suggests this might be due to short video transcripts fitting into few chunks, where even base embeddings perform well.
184
+ * **Factual Correctness (-0.056):** πŸ“‰ This also saw a decrease, which is concerning and counter-intuitive for a fine-tuning step aimed at domain specificity.
185
+ * **Overall Assessment of Fine-Tuning:** πŸ€” Mixed results for `Snowflake/snowflake-arctic-embed-s` fine-tuning. Faithfulness slightly up, but answer relevancy and factual correctness surprisingly dropped. Context recall was unchanged (likely due to data nature). The notebook concludes embedding model tuning isn't the prime optimization spot here. 🎯
186
+ *
187
+
188
+ # 8. Future changes
189
+
190
  * **Expected Changes & Future Improvements:**
191
+ 1. **Re-evaluate Fine-Tuning Strategy: πŸ€”** Given results, embedding fine-tuning needs review. This could involve:
192
+ * Trying a different base model (larger, better transfer learning on small datasets).
193
  * Augmenting the fine-tuning dataset or using different data generation strategies.
194
  * Adjusting fine-tuning hyperparameters.
195
+ 2. **Prompt Engineering: ✍️** Refine LLM agent prompts (supervisor, RAG) for better answer synthesis. This could boost factual correctness and relevancy, regardless of embedding model.
196
+ 3. **Advanced RAG Techniques: ✨** Explore methods like re-ranking, query transformations, or HyDE. The goal is to improve context quality and relevance for the LLM.
197
+ 4. **LLM for Generation: 🧠** Experiment with different LLMs for answer generation. `evaluate_rag.ipynb` uses `gpt-4.1-nano` (RAG) and `gpt-4.1-mini` (evaluator); `app.py` uses `gpt-4.1-mini`. Consistency or a more powerful model might improve results.
198
+ 5. **Iterative Evaluation: πŸ”** Keep using RAGAS on the golden dataset. This will meticulously track each change's impact.
199
 
200
  This concludes the update to `ANSWER.md` based on your instructions.