Spaces:
Sleeping
Sleeping
| # Data Pipeline | |
| Source data and documentation for the vector index. | |
| ## Files | |
| | File | Description | | |
| |------|-------------| | |
| | `episodes_website.json` | Episode metadata (108 episodes): titles, guests, YouTube URLs | | |
| | `episodes_embedding_chunks.csv` | 14,865 transcript chunks with timestamps | | |
| ## What Gets Embedded | |
| Each chunk is a structured text block that includes episode context and transcript content: | |
| ``` | |
| Episode: 455 | |
| Chapter: Introduction | |
| BEFORE (context): | |
| [~50 words from previous segment] | |
| MAIN (target): | |
| Adam Frank: If we don't ask how long they last, but instead ask | |
| what's the probability that there have been any civilizations at all... | |
| [~300 words - the core content we're searching] | |
| AFTER (context): | |
| [~50 words from next segment] | |
| ``` | |
| **Key design choices:** | |
| - Episode number and chapter title are embedded with the text β this helps retrieval when users ask about specific episodes or topics | |
| - The MAIN section contains the actual transcript with speaker names inline | |
| - BEFORE/AFTER provide context so the LLM understands the quote in its original setting | |
| - Overlapping windows ensure content isn't awkwardly split at chunk boundaries | |
| ## Vector Store | |
| **Embeddings:** OpenAI `text-embedding-3-small` (1536 dimensions) | |
| **Index:** FAISS flat index (`IndexFlatL2`) | |
| - Simple brute-force similarity search | |
| - Works well for our scale (~15K vectors) | |
| - No approximation β returns exact nearest neighbors | |
| - Trade-off: Larger indices (100K+) would benefit from IVF or HNSW for speed | |
| **Storage:** `../faiss_index.db/` | |
| - `index.faiss` β vector data | |
| - `index.pkl` β metadata (episode, timestamp, URL, chapter) | |
| ## Adding New Episodes | |
| To add new episodes or rebuild the index: | |
| 1. Scrape new episode metadata and transcripts | |
| 2. Process transcripts into chunks using the same BEFORE/MAIN/AFTER format | |
| 3. Append to `episodes_embedding_chunks.csv` | |
| 4. Rebuild the FAISS index: | |
| ```bash | |
| uv run python build_index.py | |
| ``` | |
| 5. Check index status: | |
| ```bash | |
| uv run python build_index.py --check | |
| ``` | |
| The flat index must be fully rebuilt β it doesn't support incremental additions. For a production system with frequent updates, consider using a vector database like Pinecone or Weaviate that supports upserts. | |
| **Current data (as of Feb 2026):** 108 episodes (#276β#491), newest: Peter Steinberger (OpenClaw) | |
| ## Privacy | |
| No user data is stored here. All content is from publicly available podcast episodes. | |