Spaces:

StanKonkin
/

Podcast_Assistant

Sleeping

App Files Files Community

Podcast_Assistant / data /README.md

StanDataCamp

Adding 2 latest podcasts

368f3e0 about 2 months ago

preview code

raw

history blame contribute delete

2.44 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Data Pipeline

Source data and documentation for the vector index.

Files

File	Description
`episodes_website.json`	Episode metadata (108 episodes): titles, guests, YouTube URLs
`episodes_embedding_chunks.csv`	14,865 transcript chunks with timestamps

What Gets Embedded

Each chunk is a structured text block that includes episode context and transcript content:

Episode: 455
Chapter: Introduction

BEFORE (context):
[~50 words from previous segment]

MAIN (target):
Adam Frank: If we don't ask how long they last, but instead ask 
what's the probability that there have been any civilizations at all...
[~300 words - the core content we're searching]

AFTER (context):
[~50 words from next segment]

Key design choices:

Episode number and chapter title are embedded with the text — this helps retrieval when users ask about specific episodes or topics
The MAIN section contains the actual transcript with speaker names inline
BEFORE/AFTER provide context so the LLM understands the quote in its original setting
Overlapping windows ensure content isn't awkwardly split at chunk boundaries

Vector Store

Embeddings: OpenAI text-embedding-3-small (1536 dimensions)

Index: FAISS flat index (IndexFlatL2)

Simple brute-force similarity search
Works well for our scale (~15K vectors)
No approximation — returns exact nearest neighbors
Trade-off: Larger indices (100K+) would benefit from IVF or HNSW for speed

Storage: ../faiss_index.db/

index.faiss — vector data
index.pkl — metadata (episode, timestamp, URL, chapter)

Adding New Episodes

To add new episodes or rebuild the index:

Scrape new episode metadata and transcripts
Process transcripts into chunks using the same BEFORE/MAIN/AFTER format
Append to episodes_embedding_chunks.csv
Rebuild the FAISS index:
```
uv run python build_index.py
```
Check index status:
```
uv run python build_index.py --check
```

The flat index must be fully rebuilt — it doesn't support incremental additions. For a production system with frequent updates, consider using a vector database like Pinecone or Weaviate that supports upserts.

Current data (as of Feb 2026): 108 episodes (#276–#491), newest: Peter Steinberger (OpenClaw)

Privacy

No user data is stored here. All content is from publicly available podcast episodes.