--- title: Substack Semantic Search emoji: 🔎 colorFrom: blue colorTo: green sdk: gradio sdk_version: "6.0.0" app_file: app.py pinned: false --- # 🔎 Semantic Search over Substack Posts This Space hosts a semantic search engine built over a collection of Substack HTML posts. It uses **SentenceTransformers**, **FAISS**, and **Gradio** to provide fast, offline semantic similarity search. --- ## 🚀 How It Works ### 1. **Chunk + Embed** HTML posts from the `posts/` directory are: - parsed with BeautifulSoup - split into manageable text chunks - embedded using `all-MiniLM-L6-v2` - stored in a FAISS vector index ### 2. **Vector Search** At runtime, the app: - loads `faiss_index.bin` and `faiss_meta.pkl` - embeds the user query - retrieves the most semantically relevant chunks ### 3. **Gradio App** The search UI is powered by Gradio and runs fully offline inside this Space. --- ## Local Usage To rebuild the FAISS index locally: ``` pip install -r requirements.txt python src/build_index.py python app.py ```` Ensure your `.html` files live in: ``` posts/ ``` Make sure these files are at root ``` faiss_index.bin faiss_meta.pkl app.py requirements.txt ``` Once your local run works: ``` python app.py ```