mbudisic commited on
Commit
90e23a5
·
1 Parent(s): 396bace

Created a TDS version of the post and a mermaid diagram

Browse files
Files changed (2) hide show
  1. BLOGPOST.md +45 -27
  2. BLOGPOST_TDS.md +218 -0
BLOGPOST.md CHANGED
@@ -38,13 +38,31 @@ For our PsTuts RAG project, which uses video class transcripts, this is very imp
38
  **How it Works (The Gist):**
39
  Semantic chunking often uses embedding models. These models turn text into number lists (vectors) that show its meaning. By comparing how much alike the vectors for nearby sentences are, the system can find points where the meaning changes a lot. This suggests a topic shift and a good spot for a chunk break.
40
 
41
- The [Langchain library](https://python.langchain.com/docs/get_started/introduction) is a great tool for building LLM apps. It offers tools like the `SemanticChunker`. As the Langchain documentation for the `SemanticChunker` notes, its approach is "Taken from Greg Kamradt's wonderful notebook... All credits to him." ([see docs](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html#semanticchunker)). To honor this, in our project, we'll call the output of this process a **"Kamradt Chunk."** This term will refer to a semantically coherent segment of text produced by this method.
42
 
43
  ---
44
 
45
  ### 🛠️ Our First Try: Semantic Chunking with Time Links
46
 
47
- Let's look at how we first did this for the PsTuts video transcripts. A main reason for this first version was the need to link these Kamradt Chunks back to their exact times in the original videos. This helps users who want to jump right to that moment in a class.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  You can see the code in our GitHub project: [`mbudisic/pstuts-rag`](https://github.com/mbudisic/pstuts-rag/blob/main/).
50
 
@@ -83,9 +101,9 @@ We use two main loaders, found in `pstuts_rag/pstuts_rag/loader.py` ([view on Gi
83
  # )
84
  ```
85
 
86
- This way, we have two views of our data: one with tiny, timed sentences, and one with full texts ready for smart cutting into Kamradt Chunks.
87
 
88
- **Step 2: Semantic Cutting (Making Kamradt Chunks) 🧠🔪**
89
 
90
  This part happens mostly in the `chunk_transcripts` function in `pstuts_rag/pstuts_rag/datastore.py` ([view on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/pstuts_rag/pstuts_rag/datastore.py)).
91
 
@@ -102,21 +120,21 @@ We give the `docs_full_transcript` (from `VideoTranscriptBulkLoader`) to Langcha
102
  # ]
103
  # )
104
  # # Make a flat list of documents
105
- # docs_chunks_semantic: List[Document] = [] # These will be our Kamradt Chunks
106
  # for group in docs_group:
107
  # docs_chunks_semantic.extend(group)
108
  # ...
109
  ```
110
 
111
- The `SemanticChunker` smartly cuts the long transcript from each video into smaller, related Kamradt Chunks. Each of these `docs_chunks_semantic` (our Kamradt Chunks) now aims to be a clear idea or step from the class.
112
 
113
- Now, our Kamradt Chunks make sense text-wise. But they miss a key thing for videos: **timing**. A user asking "How do I use the clone stamp tool?" wants the text answer. But they also want to know *where* in the video that part is.
114
 
115
- **Step 3: Linking Kamradt Chunks to Times 🕰️🔗**
116
 
117
- This is a core part of our first plan: linking the Kamradt Chunks to the original, timed sentences. We need to find which of our first, small sentences (from `VideoTranscriptChunkLoader`) make up each new Kamradt Chunk.
118
 
119
- Still in `chunk_transcripts` (`datastore.py`), we go through each Kamradt Chunk (`docs_chunks_semantic`). For each one, we look at our list of original, timed sentences (`docs_chunks_verbatim`):
120
 
121
  ```python
122
  # Part of chunk_transcripts in datastore.py
@@ -129,7 +147,7 @@ Still in `chunk_transcripts` (`datastore.py`), we go through each Kamradt Chunk
129
  # video_id_to_chunks[video_id] = []
130
  # video_id_to_chunks[video_id].append(chunk_v)
131
 
132
- # for chunk_s in docs_chunks_semantic: # Our new Kamradt Chunk
133
  # video_id = chunk_s.metadata["video_id"]
134
  # # Only check verbatim chunks from the same video
135
  # potential_subchunks = video_id_to_chunks.get(video_id, [])
@@ -141,9 +159,9 @@ Still in `chunk_transcripts` (`datastore.py`), we go through each Kamradt Chunk
141
  # ...
142
  ```
143
 
144
- The line `if c.page_content in chunk_s.page_content` is key. It works on the idea that the text of an original, timed sentence will be inside the text of the bigger Kamradt Chunk it is part of. This direct search works because the Kamradt Chunk is usually made of several original sentences put together.
145
 
146
- After finding all original sentences (`subchunks`) that make up a Kamradt Chunk, we get their times:
147
 
148
  ```python
149
  # Part of chunk_transcripts in datastore.py
@@ -156,30 +174,30 @@ After finding all original sentences (`subchunks`) that make up a Kamradt Chunk,
156
 
157
  # if times: # Check if times list is not empty
158
  # chunk_s.metadata["start"], chunk_s.metadata["stop"] = (
159
- # times[0][0], # Start time of the first sentence in the Kamradt Chunk
160
- # times[-1][-1], # End time of the last sentence in the Kamradt Chunk
161
  # )
162
  # else:
163
  # chunk_s.metadata["start"], chunk_s.metadata["stop"] = None, None
164
  # ...
165
  ```
166
 
167
- So, each Kamradt Chunk now has more info:
168
  * `speech_start_stop_times`: A list of (start, end) times for every original sentence in it.
169
- * `start`: The start time of the very first sentence in the Kamradt Chunk.
170
- * `stop`: The end time of the very last sentence in the Kamradt Chunk.
171
 
172
- This extra info is very useful. When our RAG system finds a Kamradt Chunk, it gets not just clear text, but also exact timing. This lets the user go right to that part in the source video.
173
 
174
  **Step 4: Storing for Search in a Vector Database 💾**
175
 
176
- Once our Kamradt Chunks are made and timed, we make vectors from them (again, using `OpenAIEmbeddings`). Then we store them in our vector database, Qdrant. The `DatastoreManager` class handles this, also in `pstuts_rag/pstuts_rag/datastore.py`.
177
 
178
  ```python
179
  # Part of DatastoreManager in datastore.py
180
  # ...
181
  # async def populate_database(self, raw_docs: List[Dict[str, Any]]) -> int:
182
- # # Make Kamradt Chunks (with time links)
183
  # self.docs: List[Document] = await chunk_transcripts(
184
  # json_transcripts=raw_docs,
185
  # semantic_chunker_embedding_model=self.embeddings,
@@ -188,7 +206,7 @@ Once our Kamradt Chunks are made and timed, we make vectors from them (again, us
188
  # ...
189
  ```
190
 
191
- The `DatastoreManager` creates vector embeddings for these rich Kamradt Chunks. It then puts them into Qdrant, so the RAG system can search them fast.
192
 
193
  ---
194
 
@@ -196,9 +214,9 @@ The `DatastoreManager` creates vector embeddings for these rich Kamradt Chunks.
196
 
197
  Why use this specific way for our first try?
198
 
199
- 1. **More To-the-Point Answers:** Kamradt Chunks aim to give LLMs more complete and clear info. This can lead to more on-target answers than simpler chunking ways.
200
  2. **Better User Navigation:** For videos, linking chunks to times is a big plus for users. They can go right to the point in a class where the info is, saving time.
201
- 3. **Good Use of Context Space:** LLMs have a limited context space. Clear Kamradt Chunks help use this space well by giving real info, not broken bits of text.
202
  4. **Base for Fewer Errors:** When LLMs get better, focused context, they may make fewer mistakes or unsupported claims. The found chunks act as stronger guides.
203
  5. **Works for Complex Content:** As source texts (or videos) get longer and more complex, the pluses of a semantic way to chunk usually show more.
204
 
@@ -212,15 +230,15 @@ While this first way has pluses, there are things to think about for future work
212
 
213
  * **Ongoing Refinement and Evaluation:** This is our first version of the chunking strategy. Key next steps involve rigorously evaluating its performance with metrics (like RAGAS). We will also continue to explore and fine-tune aspects like embedding model selection, `SemanticChunker` parameters, and adapting our timestamp association logic, especially if we explore different underlying chunking mechanisms. Current computational costs and assumptions (like direct text matching for timestamping) will also be monitored and optimized as part of this iterative process.
214
  * **Advanced Contextualization Techniques:** We plan to investigate sophisticated methods for context generation and retrieval from other research, such as those proposed by Anthropic. This could offer alternative ways to define and deliver the most relevant information to the LLM.
215
- * **Enhanced Relatedness Measures:** Beyond semantic similarity, we aim to develop and incorporate additional signals for chunk relatedness. For sequential content like tutorials, using the temporal distance between Kamradt Chunks based on their timestamps could provide a valuable complementary metric.
216
 
217
  ---
218
 
219
  ### 🎬 Wrap-up: A Solid First Step in Chunking for RAG
220
 
221
- In essence, thoughtful data preparation, particularly through semantic chunking that preserves vital metadata like timestamps, is crucial for effective RAG systems. While simple methods offer speed, our approach in the PsTuts RAG project—creating timed 'Kamradt Chunks'—demonstrates a practical first step towards richer context and better user experience, especially for complex content like video tutorials. We invite you to explore the [PsTuts RAG project on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/) and share your own chunking insights in the comments below as we continue to refine this process.
222
 
223
  #RAG #AI #LLM #SemanticChunking #KamradtChunk #VectorDatabase #Qdrant #Langchain #Python #Developer #DataScience #MachineLearning #PsTutsRAG
224
 
225
  ---
226
- *Note: The term "Kamradt Chunk" is used in this project to refer to the output of the `SemanticChunker` (from Langchain Experimental), named in acknowledgment of Greg Kamradt's foundational work in this area, as cited in the `SemanticChunker` documentation.*
 
38
  **How it Works (The Gist):**
39
  Semantic chunking often uses embedding models. These models turn text into number lists (vectors) that show its meaning. By comparing how much alike the vectors for nearby sentences are, the system can find points where the meaning changes a lot. This suggests a topic shift and a good spot for a chunk break.
40
 
41
+ The [Langchain library](https://python.langchain.com/docs/get_started/introduction) is a great tool for building LLM apps. It offers tools like the `SemanticChunker`. As the Langchain documentation for the `SemanticChunker` notes, its approach is "Taken from [Greg Kamradt's wonderful notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)... All credits to him." ([see docs](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html#semanticchunker)). To honor this, in our project, we'll call the output of this process a **"Kamradt chunk."** This term will refer to a semantically coherent segment of text produced by this method.
42
 
43
  ---
44
 
45
  ### 🛠️ Our First Try: Semantic Chunking with Time Links
46
 
47
+ Let's look at how we first did this for the PsTuts video transcripts. A main reason for this first version was the need to link these Kamradt chunks back to their exact times in the original videos. This helps users who want to jump right to that moment in a class.
48
+
49
+ The following diagram illustrates the overall data flow from raw transcripts to their storage as queryable, timestamped Kamradt chunks in the vector database:
50
+
51
+ ```mermaid
52
+ graph TD
53
+ A[Video transcript: phrases+timestamps] -->|JSON phrase + time | B1(Phrases as documents: VideoTranscriptChunkLoader);
54
+ A --> |JSON phrase + time | B2[Merge phrases: VideoTranscriptBulkLoader ];
55
+
56
+ B1 --> |Phrase+time| F_Assoc[ Search for phrase in Kamradt chunks ];
57
+ B2 --> | Full Transcript | D[SemanticChunker];
58
+
59
+ D --> | Kamradt chunks: Multi-phrase | F_Assoc;
60
+ F_Assoc --> |Kamradt chunks + time| H[Embedding model];
61
+
62
+ H --> |Vectorized Kamradt chunks + time| J[(Qdrant Vector Database)];
63
+ ```
64
+
65
+
66
 
67
  You can see the code in our GitHub project: [`mbudisic/pstuts-rag`](https://github.com/mbudisic/pstuts-rag/blob/main/).
68
 
 
101
  # )
102
  ```
103
 
104
+ This way, we have two views of our data: one with tiny, timed sentences, and one with full texts ready for smart cutting into Kamradt chunks.
105
 
106
+ **Step 2: Semantic Cutting (Making Kamradt chunks) 🧠🔪**
107
 
108
  This part happens mostly in the `chunk_transcripts` function in `pstuts_rag/pstuts_rag/datastore.py` ([view on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/pstuts_rag/pstuts_rag/datastore.py)).
109
 
 
120
  # ]
121
  # )
122
  # # Make a flat list of documents
123
+ # docs_chunks_semantic: List[Document] = [] # These will be our Kamradt chunks
124
  # for group in docs_group:
125
  # docs_chunks_semantic.extend(group)
126
  # ...
127
  ```
128
 
129
+ The `SemanticChunker` smartly cuts the long transcript from each video into smaller, related Kamradt chunks. Each of these `docs_chunks_semantic` (our Kamradt chunks) now aims to be a clear idea or step from the class.
130
 
131
+ Now, our Kamradt chunks make sense text-wise. But they miss a key thing for videos: **timing**. A user asking "How do I use the clone stamp tool?" wants the text answer. But they also want to know *where* in the video that part is.
132
 
133
+ **Step 3: Linking Kamradt chunks to Times 🕰️🔗**
134
 
135
+ This is a core part of our first plan: linking the Kamradt chunks to the original, timed sentences. We need to find which of our first, small sentences (from `VideoTranscriptChunkLoader`) make up each new Kamradt chunk.
136
 
137
+ Still in `chunk_transcripts` (`datastore.py`), we go through each Kamradt chunk (`docs_chunks_semantic`). For each one, we look at our list of original, timed sentences (`docs_chunks_verbatim`):
138
 
139
  ```python
140
  # Part of chunk_transcripts in datastore.py
 
147
  # video_id_to_chunks[video_id] = []
148
  # video_id_to_chunks[video_id].append(chunk_v)
149
 
150
+ # for chunk_s in docs_chunks_semantic: # Our new Kamradt chunk
151
  # video_id = chunk_s.metadata["video_id"]
152
  # # Only check verbatim chunks from the same video
153
  # potential_subchunks = video_id_to_chunks.get(video_id, [])
 
159
  # ...
160
  ```
161
 
162
+ The line `if c.page_content in chunk_s.page_content` is key. It works on the idea that the text of an original, timed sentence will be inside the text of the bigger Kamradt chunk it is part of. This direct search works because the Kamradt chunk is usually made of several original sentences put together.
163
 
164
+ After finding all original sentences (`subchunks`) that make up a Kamradt chunk, we get their times:
165
 
166
  ```python
167
  # Part of chunk_transcripts in datastore.py
 
174
 
175
  # if times: # Check if times list is not empty
176
  # chunk_s.metadata["start"], chunk_s.metadata["stop"] = (
177
+ # times[0][0], # Start time of the first sentence in the Kamradt chunk
178
+ # times[-1][-1], # End time of the last sentence in the Kamradt chunk
179
  # )
180
  # else:
181
  # chunk_s.metadata["start"], chunk_s.metadata["stop"] = None, None
182
  # ...
183
  ```
184
 
185
+ So, each Kamradt chunk now has more info:
186
  * `speech_start_stop_times`: A list of (start, end) times for every original sentence in it.
187
+ * `start`: The start time of the very first sentence in the Kamradt chunk.
188
+ * `stop`: The end time of the very last sentence in the Kamradt chunk.
189
 
190
+ This extra info is very useful. When our RAG system finds a Kamradt chunk, it gets not just clear text, but also exact timing. This lets the user go right to that part in the source video.
191
 
192
  **Step 4: Storing for Search in a Vector Database 💾**
193
 
194
+ Once our Kamradt chunks are made and timed, we make vectors from them (again, using `OpenAIEmbeddings`). Then we store them in our vector database, Qdrant. The `DatastoreManager` class handles this, also in `pstuts_rag/pstuts_rag/datastore.py`.
195
 
196
  ```python
197
  # Part of DatastoreManager in datastore.py
198
  # ...
199
  # async def populate_database(self, raw_docs: List[Dict[str, Any]]) -> int:
200
+ # # Make Kamradt chunks (with time links)
201
  # self.docs: List[Document] = await chunk_transcripts(
202
  # json_transcripts=raw_docs,
203
  # semantic_chunker_embedding_model=self.embeddings,
 
206
  # ...
207
  ```
208
 
209
+ The `DatastoreManager` creates vector embeddings for these rich Kamradt chunks. It then puts them into Qdrant, so the RAG system can search them fast.
210
 
211
  ---
212
 
 
214
 
215
  Why use this specific way for our first try?
216
 
217
+ 1. **More To-the-Point Answers:** Kamradt chunks aim to give LLMs more complete and clear info. This can lead to more on-target answers than simpler chunking ways.
218
  2. **Better User Navigation:** For videos, linking chunks to times is a big plus for users. They can go right to the point in a class where the info is, saving time.
219
+ 3. **Good Use of Context Space:** LLMs have a limited context space. Clear Kamradt chunks help use this space well by giving real info, not broken bits of text.
220
  4. **Base for Fewer Errors:** When LLMs get better, focused context, they may make fewer mistakes or unsupported claims. The found chunks act as stronger guides.
221
  5. **Works for Complex Content:** As source texts (or videos) get longer and more complex, the pluses of a semantic way to chunk usually show more.
222
 
 
230
 
231
  * **Ongoing Refinement and Evaluation:** This is our first version of the chunking strategy. Key next steps involve rigorously evaluating its performance with metrics (like RAGAS). We will also continue to explore and fine-tune aspects like embedding model selection, `SemanticChunker` parameters, and adapting our timestamp association logic, especially if we explore different underlying chunking mechanisms. Current computational costs and assumptions (like direct text matching for timestamping) will also be monitored and optimized as part of this iterative process.
232
  * **Advanced Contextualization Techniques:** We plan to investigate sophisticated methods for context generation and retrieval from other research, such as those proposed by Anthropic. This could offer alternative ways to define and deliver the most relevant information to the LLM.
233
+ * **Enhanced Relatedness Measures:** Beyond semantic similarity, we aim to develop and incorporate additional signals for chunk relatedness. For sequential content like tutorials, using the temporal distance between Kamradt chunks based on their timestamps could provide a valuable complementary metric.
234
 
235
  ---
236
 
237
  ### 🎬 Wrap-up: A Solid First Step in Chunking for RAG
238
 
239
+ In essence, thoughtful data preparation, particularly through semantic chunking that preserves vital metadata like timestamps, is crucial for effective RAG systems. While simple methods offer speed, our approach in the PsTuts RAG project—creating timed 'Kamradt chunks'—demonstrates a practical first step towards richer context and better user experience, especially for complex content like video tutorials. We invite you to explore the [PsTuts RAG project on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/) and share your own chunking insights in the comments below as we continue to refine this process.
240
 
241
  #RAG #AI #LLM #SemanticChunking #KamradtChunk #VectorDatabase #Qdrant #Langchain #Python #Developer #DataScience #MachineLearning #PsTutsRAG
242
 
243
  ---
244
+ *Note: The term "Kamradt chunk" is used in this project to refer to the output of the `SemanticChunker` (from Langchain Experimental), named in acknowledgment of Greg Kamradt's foundational work in this area, as cited in the `SemanticChunker` documentation.*
BLOGPOST_TDS.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Enhancing RAG: A Practical Chunking Strategy for Video Transcripts with Timestamp Alignment
2
+
3
+ A detailed walkthrough of an initial approach to semantic chunking that links verbose tutorial content to precise video timestamps, improving context for Retrieval Augmented Generation.
4
+
5
+ ---
6
+
7
+ Retrieval Augmented Generation (RAG) systems are powerful, but their performance heavily relies on the quality of context provided to the Large Language Model (LLM). When dealing with extensive content like video tutorial transcripts, naive chunking can lead to fragmented, irrelevant, or incomplete information, ultimately degrading the user's experience. This article presents a practical, first-iteration strategy implemented in our [PsTuts RAG project](https://github.com/mbudisic/pstuts-rag/blob/main/). We'll detail how we combine semantic chunking with meticulous timestamp alignment to tackle these challenges, offering a method to create contextually rich and accurately timed chunks from video data. This approach not only aims to improve LLM responses but also allows users to directly navigate to the relevant segments within the source videos.
8
+
9
+ ---
10
+
11
+ ## The Challenge in Chunking Video Transcripts
12
+
13
+ Processing lengthy instructional video content for RAG presents a common dilemma: how do you break it down? Fixed-size segments often sever ideas mid-thought, while basic sentence or paragraph splits may not capture complete semantic units. LLMs, much like human learners, benefit from coherent, self-contained blocks of information.
14
+
15
+ **Limitations of Standard Chunking Methods:**
16
+
17
+ * **Fixed-Size Chunks:** This method, while simple, often disregards natural semantic breaks, leading to fragmented context and potentially incomplete information within chunks.
18
+ * **Sentence Splits:** While an improvement, individual sentences in detailed tutorials may not encapsulate a full concept. Conversely, multiple sentences can form a single, indivisible instructional step.
19
+ * **Paragraph Splits:** Paragraphs can group related ideas, but complex topics might span several paragraphs, or a single paragraph could cover multiple distinct points.
20
+
21
+ The core issue is that these structural methods don't inherently understand the *meaning* of the text. For RAG, this can result in retrieved chunks that provide only partial context or a mix of relevant and irrelevant details, thereby hampering the LLM's ability to generate optimal responses.
22
+
23
+ ---
24
+
25
+ ## Leveraging Semantic Chunking for Coherent Context
26
+
27
+ Semantic chunking offers a more intelligent approach by dividing text based on its underlying meaning. It identifies natural breakpoints where topics shift or ideas conclude, aiming to create chunks that are internally consistent and contextually rich. For video tutorials, where a single topic or step can span multiple sentences or vary in length, this is particularly valuable.
28
+
29
+ **The Core Idea:**
30
+ Semantic chunking often utilizes embedding models to convert text into numerical vectors that capture its meaning. By analyzing the similarity between vectors of adjacent text segments (like sentences or groups of sentences), the system can identify points where semantic similarity drops, indicating a suitable boundary for a chunk.
31
+
32
+ The [Langchain library](https://python.langchain.com/docs/get_started/introduction)
33
+ offers a `SemanticChunker` tool, inspired by
34
+ [Greg Kamradt's work](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)
35
+ ([see docs](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html#semanticchunker)).
36
+ In recognition of this, we refer to the semantically coherent text segments
37
+ produced by this process as **"Kamradt chunks"**.
38
+
39
+
40
+ ---
41
+
42
+ ## Our Implementation: Semantic Chunking with Precise Timestamping
43
+
44
+ Our initial chunking strategy for the PsTuts video transcripts was driven by a critical requirement: linking the semantically coherent Kamradt chunks back to their precise start and end times in the original videos. This allows users to not only read the retrieved information but also jump directly to the corresponding moment in the tutorial.
45
+
46
+ The following diagram illustrates the overall data flow from raw transcripts to their storage as queryable, timestamped Kamradt chunks in the vector database:
47
+
48
+ Add these and your chart will be of higher quality.
49
+
50
+ ```mermaid
51
+ graph TD
52
+ A[Video transcript: phrases+timestamps] -->|JSON phrase + time | B1(Phrases as documents: VideoTranscriptChunkLoader);
53
+ A --> |JSON phrase + time | B2[Merge phrases: VideoTranscriptBulkLoader ];
54
+
55
+ B1 --> |Phrase+time| F_Assoc[ Search for phrase in Kamradt chunks ];
56
+ B2 --> | Full Transcript | D[SemanticChunker];
57
+
58
+ D --> | Kamradt chunks: Multi-phrase | F_Assoc;
59
+ F_Assoc --> |Kamradt chunks + time| H[Embedding model];
60
+
61
+ H --> |Vectorized Kamradt chunks + time| J[(Qdrant Vector Database)];
62
+ ```
63
+
64
+ The code implementing this approach can be found in our GitHub repository: [`mbudisic/pstuts-rag`](https://github.com/mbudisic/pstuts-rag/blob/main/).
65
+
66
+ **Step 1: Dual-Perspective Data Loading**
67
+
68
+ We begin by loading the video transcript data, initially in JSON format where each sentence is paired with start and end timestamps. Two distinct loaders from `pstuts_rag/pstuts_rag/loader.py` ([view on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/pstuts_rag/pstuts_rag/loader.py)) provide the necessary views:
69
+
70
+ 1. `VideoTranscriptChunkLoader`: This loader creates a `Document` for each individual sentence, meticulously preserving its `time_start` and `time_end` metadata. These fine-grained, timestamped sentences are crucial for the later alignment process.
71
+
72
+ ```python
73
+ # Excerpt from VideoTranscriptChunkLoader in loader.py
74
+ # ...
75
+ for transcript in transcripts:
76
+ yield Document(
77
+ page_content=transcript["sent"],
78
+ metadata=metadata
79
+ {
80
+ "time_start": transcript["begin"],
81
+ "time_end": transcript["end"],
82
+ },
83
+ )
84
+ ```
85
+
86
+ 2. `VideoTranscriptBulkLoader`: This loader concatenates all sentences from a single video into one comprehensive `Document`. This complete transcript provides the broader context necessary for the `SemanticChunker` to effectively identify meaningful breakpoints.
87
+
88
+ ```python
89
+ # Excerpt from VideoTranscriptBulkLoader in loader.py
90
+ # ...
91
+ yield Document(
92
+ page_content="\n".join(
93
+ t["sent"] for t in video["transcripts"]
94
+ ),
95
+ metadata=metadata,
96
+ )
97
+ ```
98
+ This dual-loading strategy yields both a granular, timestamped sentence view and a holistic textual view essential for creating meaningful Kamradt chunks.
99
+
100
+ **Step 2: Semantic Segmentation into Kamradt chunks**
101
+
102
+ The core semantic division occurs in the `chunk_transcripts` function within `pstuts_rag/pstuts_rag/datastore.py` ([view on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/pstuts_rag/pstuts_rag/datastore.py)). The full transcripts (from `VideoTranscriptBulkLoader`) are processed by Langchain's `SemanticChunker`, powered by `OpenAIEmbeddings` (`text-embedding-3-small` model).
103
+
104
+ ```python
105
+ # Excerpt from chunk_transcripts in datastore.py
106
+ # ...
107
+ text_splitter = SemanticChunker(semantic_chunker_embedding_model)
108
+ docs_group = await asyncio.gather(
109
+ *[
110
+ text_splitter.atransform_documents(d)
111
+ for d in batch(docs_full_transcript, size=2) Batching for efficiency
112
+ ]
113
+ )
114
+ # # Flatten list of documents into Kamradt chunks
115
+ docs_chunks_semantic: List[Document] = []
116
+ for group in docs_group:
117
+ docs_chunks_semantic.extend(group)
118
+ ...
119
+ ```
120
+ The `SemanticChunker` divides each long transcript into smaller, semantically related Kamradt chunks, each representing a distinct idea or instructional step. While these chunks are now textually coherent, they lack the crucial timing information needed for video navigation.
121
+
122
+ **Step 3: Linking Kamradt chunks to Source Timestamps**
123
+
124
+ This vital step bridges our Kamradt chunks with the original sentence-level timestamps. We iterate through each generated Kamradt chunk and identify the original, timestamped sentences (from `VideoTranscriptChunkLoader`) that comprise it. This is achieved by checking if the text of an original sentence is contained within the text of the Kamradt chunk.
125
+
126
+ ```python
127
+ # Excerpt from chunk_transcripts in datastore.py
128
+ # ...
129
+ # # Create a lookup for faster access to original timestamped sentences
130
+ video_id_to_chunks: Dict[int, List[Document]] = {}
131
+ for verbatim_chunk in docs_chunks_verbatim:
132
+ video_id: int = verbatim_chunk.metadata["video_id"]
133
+ if video_id not in video_id_to_chunks:
134
+ video_id_to_chunks[video_id] = []
135
+ video_id_to_chunks[video_id].append(verbatim_chunk)
136
+
137
+ # for semantic_chunk in docs_chunks_semantic: # Our Kamradt chunk
138
+ video_id = semantic_chunk.metadata["video_id"]
139
+ potential_subchunks = video_id_to_chunks.get(video_id, [])
140
+ subchunks = [
141
+ original_sentence
142
+ for original_sentence in potential_subchunks
143
+ if original_sentence.page_content in semantic_chunk.page_content
144
+ ]
145
+ ```
146
+ Once all constituent original sentences (`subchunks`) are collected for a Kamradt chunk, their timestamps are aggregated:
147
+
148
+ ```python
149
+ # Excerpt from chunk_transcripts in datastore.py
150
+ # ...
151
+ times = [
152
+ (t.metadata["time_start"], t.metadata["time_end"])
153
+ for t in subchunks
154
+ ]
155
+ semantic_chunk.metadata["speech_start_stop_times"] = times Store all individual sentence times
156
+
157
+ if times:
158
+ semantic_chunk.metadata["start"] = times[0][0] Start time of the first sentence
159
+ semantic_chunk.metadata["stop"] = times[-1][-1] End time of the last sentence
160
+ else:
161
+ semantic_chunk.metadata["start"], semantic_chunk.metadata["stop"] = None, None
162
+ ...
163
+ ```
164
+
165
+ Each Kamradt chunk is thereby enriched with:
166
+ * `speech_start_stop_times`: A list of (start, end) tuples for every original sentence it contains.
167
+ * `start`: The precise start time of the semantic passage in the video.
168
+ * `stop`: The precise end time of the semantic passage in the video.
169
+
170
+ **Step 4: Vectorization and Storage for Retrieval**
171
+
172
+ Finally, these timestamp-enriched Kamradt chunks are vectorized using `OpenAIEmbeddings` and stored in a Qdrant vector database. This process, managed by the `DatastoreManager` (also in `pstuts_rag/pstuts_rag/datastore.py`), makes the chunks efficiently searchable for the RAG system.
173
+
174
+ ```python
175
+ # Excerpt from DatastoreManager in datastore.py
176
+
177
+ async def populate_database(self, raw_docs: List[Dict[str, Any]]) -> int:
178
+ self.docs: List[Document] = await chunk_transcripts(
179
+ json_transcripts=raw_docs,
180
+ semantic_chunker_embedding_model=self.embeddings,
181
+ )
182
+
183
+ ```
184
+
185
+ ---
186
+
187
+ ## Advantages of This Timestamp-Aware Chunking Method
188
+
189
+ This initial approach to chunking video transcripts offers several benefits for RAG systems:
190
+
191
+ 1. **Enhanced Relevance:** Semantically coherent Kamradt chunks provide LLMs with more complete and contextually sound information, potentially leading to more accurate and relevant responses.
192
+ 2. **Improved User Experience:** Linking chunks to precise video timestamps allows users to instantly navigate to the relevant segment of a tutorial, significantly improving usability.
193
+ 3. **Efficient Context Window Utilization:** Meaningful chunks make better use of the LLM's limited context window compared to fragmented text.
194
+ 4. **Foundation for Reduced Errors:** Better, more focused context can help ground the LLM, potentially reducing the likelihood of factual errors or hallucinations.
195
+ 5. **Scalability for Complex Content:** This semantic approach is generally more robust for handling long and intricate instructional content than simpler structural methods.
196
+
197
+ This strategy means that the PsTuts RAG system can provide answers that are not only textually derived from the tutorial content but are also directly linked to the specific explanatory segments in the videos.
198
+
199
+ ---
200
+
201
+ ## Current Considerations and Future Directions
202
+
203
+ While this methodology offers a solid foundation, further development will focus on:
204
+
205
+ * **Ongoing Refinement and Evaluation:** This is our first iteration. Key next steps involve rigorously evaluating performance with metrics (like RAGAS). We will continue to explore and fine-tune aspects like embedding model selection, `SemanticChunker` parameters, and adapting our timestamp association logic, especially if we explore different underlying chunking mechanisms. Current computational costs and the assumption of direct text matching for timestamping will also be monitored and optimized.
206
+ * **Advanced Contextualization Techniques:** We plan to investigate sophisticated methods for context generation and retrieval from other research, such as those proposed by Anthropic. This could offer alternative ways to define and deliver the most relevant information to the LLM.
207
+ * **Enhanced Relatedness Measures:** Beyond semantic similarity, we aim to develop and incorporate additional signals for chunk relatedness. For sequential content like tutorials, using the temporal distance between Kamradt chunks based on their timestamps could provide a valuable complementary metric.
208
+
209
+ ---
210
+
211
+ ## Conclusion: A Practical Step Towards Better Video RAG
212
+
213
+ Thoughtful data preparation is paramount for effective RAG systems. For complex, sequential content like video tutorials, combining semantic chunking with meticulous preservation of metadata, such as timestamps, offers a significant improvement over simpler methods. Our initial approach in the PsTuts RAG project—creating precisely timed 'Kamradt chunks'—illustrates a practical step towards delivering richer context and an enhanced user experience. We encourage you to explore the [PsTuts RAG project on GitHub](https://github.com/mbudisic/pstuts-rag/blob/main/) and share your insights as we continue to refine this process.
214
+
215
+ #RAG #AI #LLM #SemanticChunking #KamradtChunk #VectorDatabase #Qdrant #Langchain #Python #Developer #DataScience #MachineLearning #PsTutsRAG
216
+
217
+ ---
218
+ *Note: The term "Kamradt chunk" is used in this project to refer to the output of the `SemanticChunker` (from Langchain Experimental), named in acknowledgment of Greg Kamradt's foundational work in this area, as cited in the `SemanticChunker` documentation.*