Podcast_Assistant / data /README.md
StanDataCamp's picture
Adding 2 latest podcasts
368f3e0
# Data Pipeline
Source data and documentation for the vector index.
## Files
| File | Description |
|------|-------------|
| `episodes_website.json` | Episode metadata (108 episodes): titles, guests, YouTube URLs |
| `episodes_embedding_chunks.csv` | 14,865 transcript chunks with timestamps |
## What Gets Embedded
Each chunk is a structured text block that includes episode context and transcript content:
```
Episode: 455
Chapter: Introduction
BEFORE (context):
[~50 words from previous segment]
MAIN (target):
Adam Frank: If we don't ask how long they last, but instead ask
what's the probability that there have been any civilizations at all...
[~300 words - the core content we're searching]
AFTER (context):
[~50 words from next segment]
```
**Key design choices:**
- Episode number and chapter title are embedded with the text β€” this helps retrieval when users ask about specific episodes or topics
- The MAIN section contains the actual transcript with speaker names inline
- BEFORE/AFTER provide context so the LLM understands the quote in its original setting
- Overlapping windows ensure content isn't awkwardly split at chunk boundaries
## Vector Store
**Embeddings:** OpenAI `text-embedding-3-small` (1536 dimensions)
**Index:** FAISS flat index (`IndexFlatL2`)
- Simple brute-force similarity search
- Works well for our scale (~15K vectors)
- No approximation β€” returns exact nearest neighbors
- Trade-off: Larger indices (100K+) would benefit from IVF or HNSW for speed
**Storage:** `../faiss_index.db/`
- `index.faiss` β€” vector data
- `index.pkl` β€” metadata (episode, timestamp, URL, chapter)
## Adding New Episodes
To add new episodes or rebuild the index:
1. Scrape new episode metadata and transcripts
2. Process transcripts into chunks using the same BEFORE/MAIN/AFTER format
3. Append to `episodes_embedding_chunks.csv`
4. Rebuild the FAISS index:
```bash
uv run python build_index.py
```
5. Check index status:
```bash
uv run python build_index.py --check
```
The flat index must be fully rebuilt β€” it doesn't support incremental additions. For a production system with frequent updates, consider using a vector database like Pinecone or Weaviate that supports upserts.
**Current data (as of Feb 2026):** 108 episodes (#276–#491), newest: Peter Steinberger (OpenClaw)
## Privacy
No user data is stored here. All content is from publicly available podcast episodes.