Talip7's picture
Update README.md
220c409 verified
---
license: mit
title: πŸ€— GitHub Issue Hybrid Search & Auto-Label Assistant
sdk: gradio
colorFrom: blue
colorTo: green
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
short_description: Hybrid semantic search and auto-labeling for GitHub issues
sdk_version: 6.2.0
---
# πŸ€— GitHub Issue Hybrid Search & Auto-Label Assistant
**Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.**
This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining:
- Dense semantic search (MPNet embeddings + FAISS)
- Multilabel text classification (DistilBERT)
- A hybrid ranking strategy that fuses semantic similarity and label consistency
A live, interactive demo is available via **Hugging Face Spaces**.
---
## πŸš€ Live Demo
πŸ”— **Hugging Face Space:**
*GitHub Issue Hybrid Search & Auto-Label Assistant*
Users can describe a GitHub issue in natural language and instantly:
- See predicted issue labels (e.g. `Bug`, `Needs Triage`)
- Retrieve the most relevant existing GitHub issues
- Inspect semantic similarity, label overlap, and final hybrid scores
- Open the original GitHub issues directly
---
## πŸ” Problem Motivation
Large open-source repositories receive **thousands of issues**, making it hard to:
- Find similar historical issues
- Detect duplicates
- Assign correct labels early
- Route issues to the right maintainers
Keyword search alone is often insufficient.
This project addresses that gap with **semantic + label-aware retrieval**.
---
## 🧠 System Overview
The pipeline consists of four main stages:
### 1. Semantic Encoding
User queries are encoded using dense sentence embeddings:
- **Model:** `sentence-transformers/all-mpnet-base-v2`
Issue texts in the dataset are pre-embedded and stored for fast retrieval.
---
### 2. Semantic Retrieval
- **Index:** FAISS (runtime-built)
- Retrieves the nearest issues based on vector similarity
- Optimized for dense semantic matching rather than keywords
---
### 3. Multilabel Classification
A fine-tuned DistilBERT model predicts issue labels:
- Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree`
- Multiple labels can be assigned per issue
- Outputs confidence-based label predictions
---
### 4. Hybrid Ranking (Key Contribution)
Semantic similarity alone is not always enough.
This system uses a **hybrid scoring function**:
final_score = Ξ± Β· semantic_similarity + Ξ² Β· label_overlap
- **Semantic similarity:** how close the issue texts are in embedding space
- **Label overlap:** how well predicted labels match existing issue labels
- **Ξ± / Ξ²:** tunable weights (precision-first by design)
Issues are also **deduplicated by issue number** to avoid repeated results.
---
## 🎯 Design Principles
### Precision-First Retrieval
- The system may return fewer than *k* results intentionally
- It avoids hallucinating weakly related issues
- Returning **1–4 highly relevant issues** is considered a success
### Runtime FAISS Indexing
- FAISS indices are created at runtime
- Keeps datasets lightweight and portable on Hugging Face Hub
### Transparency
- Scores are explicitly shown:
- Semantic similarity
- Label overlap
- Final hybrid score
- GitHub URLs are fully visible and clickable
---
## πŸ“¦ Models & Data
### Embedding Model
- `sentence-transformers/all-mpnet-base-v2`
### Multilabel Classifier
- DistilBERT fine-tuned for multilabel issue classification
- Hosted on Hugging Face Hub
### Dataset
- Custom GitHub issues dataset (scikit-learn)
- Pre-computed embeddings stored on Hugging Face Hub
---
## πŸ§ͺ Example Use Cases
- GitHub issue triage
- Bug deduplication
- Support ticket analysis
- Internal engineering knowledge search
- Maintainer productivity tools
---
## πŸ›  Tech Stack
- Python
- Hugging Face Datasets & Transformers
- Sentence Transformers
- FAISS
- Gradio (UI)
- Hugging Face Spaces
---
## ✨ What This Project Demonstrates
- End-to-end ML system design (not just a model)
- Semantic search at scale
- Multilabel NLP in a real-world setting
- Hybrid ranking strategies
- Practical UX considerations for ML products
---
## πŸ“Œ Notes
This project goes **beyond tutorial-level demos** by focusing on:
- Real datasets
- Production constraints
- Explainable ranking behavior
- Clean, user-facing presentation
---
## πŸ™Œ Acknowledgements
Inspired by real GitHub workflows and the Hugging Face ecosystem.