# Data Pipeline Source data and documentation for the vector index. ## Files | File | Description | |------|-------------| | `episodes_website.json` | Episode metadata (108 episodes): titles, guests, YouTube URLs | | `episodes_embedding_chunks.csv` | 14,865 transcript chunks with timestamps | ## What Gets Embedded Each chunk is a structured text block that includes episode context and transcript content: ``` Episode: 455 Chapter: Introduction BEFORE (context): [~50 words from previous segment] MAIN (target): Adam Frank: If we don't ask how long they last, but instead ask what's the probability that there have been any civilizations at all... [~300 words - the core content we're searching] AFTER (context): [~50 words from next segment] ``` **Key design choices:** - Episode number and chapter title are embedded with the text — this helps retrieval when users ask about specific episodes or topics - The MAIN section contains the actual transcript with speaker names inline - BEFORE/AFTER provide context so the LLM understands the quote in its original setting - Overlapping windows ensure content isn't awkwardly split at chunk boundaries ## Vector Store **Embeddings:** OpenAI `text-embedding-3-small` (1536 dimensions) **Index:** FAISS flat index (`IndexFlatL2`) - Simple brute-force similarity search - Works well for our scale (~15K vectors) - No approximation — returns exact nearest neighbors - Trade-off: Larger indices (100K+) would benefit from IVF or HNSW for speed **Storage:** `../faiss_index.db/` - `index.faiss` — vector data - `index.pkl` — metadata (episode, timestamp, URL, chapter) ## Adding New Episodes To add new episodes or rebuild the index: 1. Scrape new episode metadata and transcripts 2. Process transcripts into chunks using the same BEFORE/MAIN/AFTER format 3. Append to `episodes_embedding_chunks.csv` 4. Rebuild the FAISS index: ```bash uv run python build_index.py ``` 5. Check index status: ```bash uv run python build_index.py --check ``` The flat index must be fully rebuilt — it doesn't support incremental additions. For a production system with frequent updates, consider using a vector database like Pinecone or Weaviate that supports upserts. **Current data (as of Feb 2026):** 108 episodes (#276–#491), newest: Peter Steinberger (OpenClaw) ## Privacy No user data is stored here. All content is from publicly available podcast episodes.