Spaces:

StanKonkin
/

Podcast_Assistant

Sleeping

App Files Files Community

Podcast_Assistant / data /README.md

StanDataCamp

Adding 2 latest podcasts

368f3e0 about 2 months ago

preview code

raw

history blame contribute delete

2.44 kB

	# Data Pipeline

	Source data and documentation for the vector index.

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `episodes_website.json` \| Episode metadata (108 episodes): titles, guests, YouTube URLs \|
	\| `episodes_embedding_chunks.csv` \| 14,865 transcript chunks with timestamps \|

	## What Gets Embedded

	Each chunk is a structured text block that includes episode context and transcript content:

	```
	Episode: 455
	Chapter: Introduction

	BEFORE (context):
	[~50 words from previous segment]

	MAIN (target):
	Adam Frank: If we don't ask how long they last, but instead ask
	what's the probability that there have been any civilizations at all...
	[~300 words - the core content we're searching]

	AFTER (context):
	[~50 words from next segment]
	```

	Key design choices:
	- Episode number and chapter title are embedded with the text — this helps retrieval when users ask about specific episodes or topics
	- The MAIN section contains the actual transcript with speaker names inline
	- BEFORE/AFTER provide context so the LLM understands the quote in its original setting
	- Overlapping windows ensure content isn't awkwardly split at chunk boundaries

	## Vector Store

	Embeddings: OpenAI `text-embedding-3-small` (1536 dimensions)

	Index: FAISS flat index (`IndexFlatL2`)
	- Simple brute-force similarity search
	- Works well for our scale (~15K vectors)
	- No approximation — returns exact nearest neighbors
	- Trade-off: Larger indices (100K+) would benefit from IVF or HNSW for speed

	Storage: `../faiss_index.db/`
	- `index.faiss` — vector data
	- `index.pkl` — metadata (episode, timestamp, URL, chapter)

	## Adding New Episodes

	To add new episodes or rebuild the index:

	1. Scrape new episode metadata and transcripts
	2. Process transcripts into chunks using the same BEFORE/MAIN/AFTER format
	3. Append to `episodes_embedding_chunks.csv`
	4. Rebuild the FAISS index:
	```bash
	uv run python build_index.py
	```
	5. Check index status:
	```bash
	uv run python build_index.py --check
	```

	The flat index must be fully rebuilt — it doesn't support incremental additions. For a production system with frequent updates, consider using a vector database like Pinecone or Weaviate that supports upserts.

	Current data (as of Feb 2026): 108 episodes (#276–#491), newest: Peter Steinberger (OpenClaw)

	## Privacy

	No user data is stored here. All content is from publicly available podcast episodes.