Spaces:
Sleeping
Sleeping
| license: mit | |
| title: π€ GitHub Issue Hybrid Search & Auto-Label Assistant | |
| sdk: gradio | |
| colorFrom: blue | |
| colorTo: green | |
| pinned: true | |
| thumbnail: >- | |
| https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png | |
| short_description: Hybrid semantic search and auto-labeling for GitHub issues | |
| sdk_version: 6.2.0 | |
| # π€ GitHub Issue Hybrid Search & Auto-Label Assistant | |
| **Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.** | |
| This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining: | |
| - Dense semantic search (MPNet embeddings + FAISS) | |
| - Multilabel text classification (DistilBERT) | |
| - A hybrid ranking strategy that fuses semantic similarity and label consistency | |
| A live, interactive demo is available via **Hugging Face Spaces**. | |
| --- | |
| ## π Live Demo | |
| π **Hugging Face Space:** | |
| *GitHub Issue Hybrid Search & Auto-Label Assistant* | |
| Users can describe a GitHub issue in natural language and instantly: | |
| - See predicted issue labels (e.g. `Bug`, `Needs Triage`) | |
| - Retrieve the most relevant existing GitHub issues | |
| - Inspect semantic similarity, label overlap, and final hybrid scores | |
| - Open the original GitHub issues directly | |
| --- | |
| ## π Problem Motivation | |
| Large open-source repositories receive **thousands of issues**, making it hard to: | |
| - Find similar historical issues | |
| - Detect duplicates | |
| - Assign correct labels early | |
| - Route issues to the right maintainers | |
| Keyword search alone is often insufficient. | |
| This project addresses that gap with **semantic + label-aware retrieval**. | |
| --- | |
| ## π§ System Overview | |
| The pipeline consists of four main stages: | |
| ### 1. Semantic Encoding | |
| User queries are encoded using dense sentence embeddings: | |
| - **Model:** `sentence-transformers/all-mpnet-base-v2` | |
| Issue texts in the dataset are pre-embedded and stored for fast retrieval. | |
| --- | |
| ### 2. Semantic Retrieval | |
| - **Index:** FAISS (runtime-built) | |
| - Retrieves the nearest issues based on vector similarity | |
| - Optimized for dense semantic matching rather than keywords | |
| --- | |
| ### 3. Multilabel Classification | |
| A fine-tuned DistilBERT model predicts issue labels: | |
| - Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree` | |
| - Multiple labels can be assigned per issue | |
| - Outputs confidence-based label predictions | |
| --- | |
| ### 4. Hybrid Ranking (Key Contribution) | |
| Semantic similarity alone is not always enough. | |
| This system uses a **hybrid scoring function**: | |
| final_score = Ξ± Β· semantic_similarity + Ξ² Β· label_overlap | |
| - **Semantic similarity:** how close the issue texts are in embedding space | |
| - **Label overlap:** how well predicted labels match existing issue labels | |
| - **Ξ± / Ξ²:** tunable weights (precision-first by design) | |
| Issues are also **deduplicated by issue number** to avoid repeated results. | |
| --- | |
| ## π― Design Principles | |
| ### Precision-First Retrieval | |
| - The system may return fewer than *k* results intentionally | |
| - It avoids hallucinating weakly related issues | |
| - Returning **1β4 highly relevant issues** is considered a success | |
| ### Runtime FAISS Indexing | |
| - FAISS indices are created at runtime | |
| - Keeps datasets lightweight and portable on Hugging Face Hub | |
| ### Transparency | |
| - Scores are explicitly shown: | |
| - Semantic similarity | |
| - Label overlap | |
| - Final hybrid score | |
| - GitHub URLs are fully visible and clickable | |
| --- | |
| ## π¦ Models & Data | |
| ### Embedding Model | |
| - `sentence-transformers/all-mpnet-base-v2` | |
| ### Multilabel Classifier | |
| - DistilBERT fine-tuned for multilabel issue classification | |
| - Hosted on Hugging Face Hub | |
| ### Dataset | |
| - Custom GitHub issues dataset (scikit-learn) | |
| - Pre-computed embeddings stored on Hugging Face Hub | |
| --- | |
| ## π§ͺ Example Use Cases | |
| - GitHub issue triage | |
| - Bug deduplication | |
| - Support ticket analysis | |
| - Internal engineering knowledge search | |
| - Maintainer productivity tools | |
| --- | |
| ## π Tech Stack | |
| - Python | |
| - Hugging Face Datasets & Transformers | |
| - Sentence Transformers | |
| - FAISS | |
| - Gradio (UI) | |
| - Hugging Face Spaces | |
| --- | |
| ## β¨ What This Project Demonstrates | |
| - End-to-end ML system design (not just a model) | |
| - Semantic search at scale | |
| - Multilabel NLP in a real-world setting | |
| - Hybrid ranking strategies | |
| - Practical UX considerations for ML products | |
| --- | |
| ## π Notes | |
| This project goes **beyond tutorial-level demos** by focusing on: | |
| - Real datasets | |
| - Production constraints | |
| - Explainable ranking behavior | |
| - Clean, user-facing presentation | |
| --- | |
| ## π Acknowledgements | |
| Inspired by real GitHub workflows and the Hugging Face ecosystem. |