| --- |
| title: Reddit SemanticSearch Prototype |
| emoji: 🐨 |
| colorFrom: purple |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: 5.41.0 |
| app_file: app.py |
| pinned: false |
| short_description: 'r/technology, r/gaming, r/programming etc search comments ' |
| --- |
| |
| # Reddit Semantic Search (Prototype) |
|
|
| A lightweight semantic search engine built on Reddit comments using: |
| - **Word2Vec embeddings** (trained from scratch on selected subreddits) |
| - **FAISS** for fast vector indexing and retrieval |
| - **Gradio** for a user-friendly, Reddit-themed interface |
|
|
| > ⚠️ This is an independent prototype. Not affiliated with Reddit Inc. |
|
|
| --- |
|
|
| ## Dataset |
|
|
| - Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments) |
| - Subreddits used: |
| - `askscience`, `gaming`, `technology`, `todayilearned`, `programming` |
| - Data was streamed using Hugging Face's `datasets` library and chunked using PySpark. |
|
|
| --- |
|
|
| ## Project Pipeline |
|
|
| 1. **Data Loading & Chunking** |
| - Load subreddit splits individually using streaming |
| - Group every 5 comments into a single text chunk using PySpark |
| - Clean and tokenize text for training |
|
|
| 2. **Training Word2Vec** |
| - Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks |
|
|
| 3. **Vector Indexing (FAISS)** |
| - Each chunk embedded by averaging Word2Vec vectors of words |
| - Dense vectors indexed using `faiss.IndexFlatL2` |
|
|
| 4. **Semantic Search App (Gradio)** |
| - Enter your query and select a subreddit filter |
| - Retrieves top 5 semantically similar comment chunks |
| - Built-in reranking logic can be added later |
|
|
| --- |
|
|
| ## Run the App |
|
|
| ```bash |
| pip install -r requirements.txt |
| python app.py # or run the notebook |
| |
| |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
| |