Spaces:

GenAIDevTOProd
/

Reddit-SemanticSearch-Prototype

Sleeping

App Files Files Community

Reddit-SemanticSearch-Prototype / README.md

GenAIDevTOProd

Update README.md

a03beaf verified 9 months ago

preview code

raw

history blame

1.79 kB

	---
	title: Reddit SemanticSearch Prototype
	emoji: 🐨
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.41.0
	app_file: app.py
	pinned: false
	short_description: 'r/technology, r/gaming, r/programming etc search comments '
	---

	# Reddit Semantic Search (Prototype)

	A lightweight semantic search engine built on Reddit comments using:
	- Word2Vec embeddings (trained from scratch on selected subreddits)
	- FAISS for fast vector indexing and retrieval
	- Gradio for a user-friendly, Reddit-themed interface

	> ⚠️ This is an independent prototype. Not affiliated with Reddit Inc.

	---

	## Dataset

	- Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments)
	- Subreddits used:
	- `askscience`, `gaming`, `technology`, `todayilearned`, `programming`
	- Data was streamed using Hugging Face's `datasets` library and chunked using PySpark.

	---

	## Project Pipeline

	1. Data Loading & Chunking
	- Load subreddit splits individually using streaming
	- Group every 5 comments into a single text chunk using PySpark
	- Clean and tokenize text for training

	2. Training Word2Vec
	- Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks

	3. Vector Indexing (FAISS)
	- Each chunk embedded by averaging Word2Vec vectors of words
	- Dense vectors indexed using `faiss.IndexFlatL2`

	4. Semantic Search App (Gradio)
	- Enter your query and select a subreddit filter
	- Retrieves top 5 semantically similar comment chunks
	- Built-in reranking logic can be added later

	---

	## Run the App

	```bash
	pip install -r requirements.txt
	python app.py # or run the notebook


	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference