Spaces:

Talip7
/

github-issue-hybrid-search

Sleeping

App Files Files Community

Talip7 commited on Jan 5

Commit

3fdde58

verified ·

1 Parent(s): b320624

Create README.md

Browse files

Files changed (1) hide show

README.md +166 -10

README.md CHANGED Viewed

@@ -1,14 +1,170 @@
 ---
-title: Github Issue Hybrid Search
-emoji: 🐢
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
-pinned: false
 license: mit
-short_description: Hybrid semantic search and multilabel classification
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 license: mit
+title: 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant
+sdk: gradio
+colorFrom: blue
+colorTo: green
+pinned: true
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
+short_description: Hybrid semantic search and auto-labeling for GitHub issues
+---
+# 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant
+**Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.**
+This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining:
+- Dense semantic search (MPNet embeddings + FAISS)
+- Multilabel text classification (DistilBERT)
+- A hybrid ranking strategy that fuses semantic similarity and label consistency
+A live, interactive demo is available via **Hugging Face Spaces**.
+---
+## 🚀 Live Demo
+🔗 **Hugging Face Space:**
+*GitHub Issue Hybrid Search & Auto-Label Assistant*
+Users can describe a GitHub issue in natural language and instantly:
+- See predicted issue labels (e.g. `Bug`, `Needs Triage`)
+- Retrieve the most relevant existing GitHub issues
+- Inspect semantic similarity, label overlap, and final hybrid scores
+- Open the original GitHub issues directly
+---
+## 🔍 Problem Motivation
+Large open-source repositories receive **thousands of issues**, making it hard to:
+- Find similar historical issues
+- Detect duplicates
+- Assign correct labels early
+- Route issues to the right maintainers
+Keyword search alone is often insufficient.
+This project addresses that gap with **semantic + label-aware retrieval**.
+---
+## 🧠 System Overview
+The pipeline consists of four main stages:
+### 1. Semantic Encoding
+User queries are encoded using dense sentence embeddings:
+- **Model:** `sentence-transformers/all-mpnet-base-v2`
+Issue texts in the dataset are pre-embedded and stored for fast retrieval.
+---
+### 2. Semantic Retrieval
+- **Index:** FAISS (runtime-built)
+- Retrieves the nearest issues based on vector similarity
+- Optimized for dense semantic matching rather than keywords
+---
+### 3. Multilabel Classification
+A fine-tuned DistilBERT model predicts issue labels:
+- Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree`
+- Multiple labels can be assigned per issue
+- Outputs confidence-based label predictions
+---
+### 4. Hybrid Ranking (Key Contribution)
+Semantic similarity alone is not always enough.
+This system uses a **hybrid scoring function**:
+final_score = α · semantic_similarity + β · label_overlap
+- **Semantic similarity:** how close the issue texts are in embedding space
+- **Label overlap:** how well predicted labels match existing issue labels
+- **α / β:** tunable weights (precision-first by design)
+Issues are also **deduplicated by issue number** to avoid repeated results.
 ---
+## 🎯 Design Principles
+### Precision-First Retrieval
+- The system may return fewer than *k* results intentionally
+- It avoids hallucinating weakly related issues
+- Returning **1–4 highly relevant issues** is considered a success
+### Runtime FAISS Indexing
+- FAISS indices are created at runtime
+- Keeps datasets lightweight and portable on Hugging Face Hub
+### Transparency
+- Scores are explicitly shown:
+  - Semantic similarity
+  - Label overlap
+  - Final hybrid score
+- GitHub URLs are fully visible and clickable
+---
+## 📦 Models & Data
+### Embedding Model
+- `sentence-transformers/all-mpnet-base-v2`
+### Multilabel Classifier
+- DistilBERT fine-tuned for multilabel issue classification
+- Hosted on Hugging Face Hub
+### Dataset
+- Custom GitHub issues dataset (scikit-learn)
+- Pre-computed embeddings stored on Hugging Face Hub
+---
+## 🧪 Example Use Cases
+- GitHub issue triage
+- Bug deduplication
+- Support ticket analysis
+- Internal engineering knowledge search
+- Maintainer productivity tools
+---
+## 🛠 Tech Stack
+- Python
+- Hugging Face Datasets & Transformers
+- Sentence Transformers
+- FAISS
+- Gradio (UI)
+- Hugging Face Spaces
+---
+## ✨ What This Project Demonstrates
+- End-to-end ML system design (not just a model)
+- Semantic search at scale
+- Multilabel NLP in a real-world setting
+- Hybrid ranking strategies
+- Practical UX considerations for ML products
+---
+## 📌 Notes
+This project goes **beyond tutorial-level demos** by focusing on:
+- Real datasets
+- Production constraints
+- Explainable ranking behavior
+- Clean, user-facing presentation
+---
+## 🙌 Acknowledgements
+Inspired by real GitHub workflows and the Hugging Face ecosystem.