Spaces:

Talip7
/

github-issue-hybrid-search

Sleeping

App Files Files Community

github-issue-hybrid-search / README.md

Talip7

Update README.md

220c409 verified 5 months ago

preview code

raw

history blame contribute delete

4.58 kB

	---
	license: mit
	title: 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant
	sdk: gradio
	colorFrom: blue
	colorTo: green
	pinned: true
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
	short_description: Hybrid semantic search and auto-labeling for GitHub issues
	sdk_version: 6.2.0
	---

	# 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant

	Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.

	This project demonstrates a production-oriented hybrid retrieval system that helps with GitHub issue triage by combining:
	- Dense semantic search (MPNet embeddings + FAISS)
	- Multilabel text classification (DistilBERT)
	- A hybrid ranking strategy that fuses semantic similarity and label consistency

	A live, interactive demo is available via Hugging Face Spaces.

	---

	## 🚀 Live Demo

	🔗 Hugging Face Space:
	GitHub Issue Hybrid Search & Auto-Label Assistant

	Users can describe a GitHub issue in natural language and instantly:
	- See predicted issue labels (e.g. `Bug`, `Needs Triage`)
	- Retrieve the most relevant existing GitHub issues
	- Inspect semantic similarity, label overlap, and final hybrid scores
	- Open the original GitHub issues directly

	---

	## 🔍 Problem Motivation

	Large open-source repositories receive thousands of issues, making it hard to:
	- Find similar historical issues
	- Detect duplicates
	- Assign correct labels early
	- Route issues to the right maintainers

	Keyword search alone is often insufficient.
	This project addresses that gap with semantic + label-aware retrieval.

	---

	## 🧠 System Overview

	The pipeline consists of four main stages:

	### 1. Semantic Encoding
	User queries are encoded using dense sentence embeddings:
	- Model: `sentence-transformers/all-mpnet-base-v2`

	Issue texts in the dataset are pre-embedded and stored for fast retrieval.

	---

	### 2. Semantic Retrieval
	- Index: FAISS (runtime-built)
	- Retrieves the nearest issues based on vector similarity
	- Optimized for dense semantic matching rather than keywords

	---

	### 3. Multilabel Classification
	A fine-tuned DistilBERT model predicts issue labels:
	- Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree`
	- Multiple labels can be assigned per issue
	- Outputs confidence-based label predictions

	---

	### 4. Hybrid Ranking (Key Contribution)
	Semantic similarity alone is not always enough.
	This system uses a hybrid scoring function:
	final_score = α · semantic_similarity + β · label_overlap

	- Semantic similarity: how close the issue texts are in embedding space
	- Label overlap: how well predicted labels match existing issue labels
	- α / β: tunable weights (precision-first by design)

	Issues are also deduplicated by issue number to avoid repeated results.

	---

	## 🎯 Design Principles

	### Precision-First Retrieval
	- The system may return fewer than k results intentionally
	- It avoids hallucinating weakly related issues
	- Returning 1–4 highly relevant issues is considered a success

	### Runtime FAISS Indexing
	- FAISS indices are created at runtime
	- Keeps datasets lightweight and portable on Hugging Face Hub

	### Transparency
	- Scores are explicitly shown:
	- Semantic similarity
	- Label overlap
	- Final hybrid score
	- GitHub URLs are fully visible and clickable

	---

	## 📦 Models & Data

	### Embedding Model
	- `sentence-transformers/all-mpnet-base-v2`

	### Multilabel Classifier
	- DistilBERT fine-tuned for multilabel issue classification
	- Hosted on Hugging Face Hub

	### Dataset
	- Custom GitHub issues dataset (scikit-learn)
	- Pre-computed embeddings stored on Hugging Face Hub

	---

	## 🧪 Example Use Cases

	- GitHub issue triage
	- Bug deduplication
	- Support ticket analysis
	- Internal engineering knowledge search
	- Maintainer productivity tools

	---

	## 🛠 Tech Stack

	- Python
	- Hugging Face Datasets & Transformers
	- Sentence Transformers
	- FAISS
	- Gradio (UI)
	- Hugging Face Spaces

	---

	## ✨ What This Project Demonstrates

	- End-to-end ML system design (not just a model)
	- Semantic search at scale
	- Multilabel NLP in a real-world setting
	- Hybrid ranking strategies
	- Practical UX considerations for ML products

	---

	## 📌 Notes

	This project goes beyond tutorial-level demos by focusing on:
	- Real datasets
	- Production constraints
	- Explainable ranking behavior
	- Clean, user-facing presentation

	---

	## 🙌 Acknowledgements

	Inspired by real GitHub workflows and the Hugging Face ecosystem.