Substack-Search / README.md
DTanzillo's picture
Update README.md
ac4502f verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Substack Semantic Search
emoji: πŸ”Ž
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false

πŸ”Ž Semantic Search over Substack Posts

This Space hosts a semantic search engine built over a collection of Substack HTML posts.
It uses SentenceTransformers, FAISS, and Gradio to provide fast, offline semantic similarity search.


πŸš€ How It Works

1. Chunk + Embed

HTML posts from the posts/ directory are:

  • parsed with BeautifulSoup
  • split into manageable text chunks
  • embedded using all-MiniLM-L6-v2
  • stored in a FAISS vector index

2. Vector Search

At runtime, the app:

  • loads faiss_index.bin and faiss_meta.pkl
  • embeds the user query
  • retrieves the most semantically relevant chunks

3. Gradio App

The search UI is powered by Gradio and runs fully offline inside this Space.


Local Usage

To rebuild the FAISS index locally:

pip install -r requirements.txt
python src/build_index.py
python app.py

Ensure your .html files live in:

posts/

Make sure these files are at root

faiss_index.bin
faiss_meta.pkl
app.py
requirements.txt

Once your local run works:

python app.py