Spaces:
Sleeping
Sleeping
Create README.md
Browse files
README.md
CHANGED
|
@@ -1,14 +1,170 @@
|
|
| 1 |
---
|
| 2 |
-
title: Github Issue Hybrid Search
|
| 3 |
-
emoji: π’
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: 6.2.0
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
license: mit
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
+
title: π€ GitHub Issue Hybrid Search & Auto-Label Assistant
|
| 4 |
+
sdk: gradio
|
| 5 |
+
colorFrom: blue
|
| 6 |
+
colorTo: green
|
| 7 |
+
pinned: true
|
| 8 |
+
thumbnail: >-
|
| 9 |
+
https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
|
| 10 |
+
short_description: Hybrid semantic search and auto-labeling for GitHub issues
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# π€ GitHub Issue Hybrid Search & Auto-Label Assistant
|
| 14 |
+
|
| 15 |
+
**Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.**
|
| 16 |
+
|
| 17 |
+
This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining:
|
| 18 |
+
- Dense semantic search (MPNet embeddings + FAISS)
|
| 19 |
+
- Multilabel text classification (DistilBERT)
|
| 20 |
+
- A hybrid ranking strategy that fuses semantic similarity and label consistency
|
| 21 |
+
|
| 22 |
+
A live, interactive demo is available via **Hugging Face Spaces**.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## π Live Demo
|
| 27 |
+
|
| 28 |
+
π **Hugging Face Space:**
|
| 29 |
+
*GitHub Issue Hybrid Search & Auto-Label Assistant*
|
| 30 |
+
|
| 31 |
+
Users can describe a GitHub issue in natural language and instantly:
|
| 32 |
+
- See predicted issue labels (e.g. `Bug`, `Needs Triage`)
|
| 33 |
+
- Retrieve the most relevant existing GitHub issues
|
| 34 |
+
- Inspect semantic similarity, label overlap, and final hybrid scores
|
| 35 |
+
- Open the original GitHub issues directly
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## π Problem Motivation
|
| 40 |
+
|
| 41 |
+
Large open-source repositories receive **thousands of issues**, making it hard to:
|
| 42 |
+
- Find similar historical issues
|
| 43 |
+
- Detect duplicates
|
| 44 |
+
- Assign correct labels early
|
| 45 |
+
- Route issues to the right maintainers
|
| 46 |
+
|
| 47 |
+
Keyword search alone is often insufficient.
|
| 48 |
+
This project addresses that gap with **semantic + label-aware retrieval**.
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## π§ System Overview
|
| 53 |
+
|
| 54 |
+
The pipeline consists of four main stages:
|
| 55 |
+
|
| 56 |
+
### 1. Semantic Encoding
|
| 57 |
+
User queries are encoded using dense sentence embeddings:
|
| 58 |
+
- **Model:** `sentence-transformers/all-mpnet-base-v2`
|
| 59 |
+
|
| 60 |
+
Issue texts in the dataset are pre-embedded and stored for fast retrieval.
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
### 2. Semantic Retrieval
|
| 65 |
+
- **Index:** FAISS (runtime-built)
|
| 66 |
+
- Retrieves the nearest issues based on vector similarity
|
| 67 |
+
- Optimized for dense semantic matching rather than keywords
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
### 3. Multilabel Classification
|
| 72 |
+
A fine-tuned DistilBERT model predicts issue labels:
|
| 73 |
+
- Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree`
|
| 74 |
+
- Multiple labels can be assigned per issue
|
| 75 |
+
- Outputs confidence-based label predictions
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
### 4. Hybrid Ranking (Key Contribution)
|
| 80 |
+
Semantic similarity alone is not always enough.
|
| 81 |
+
This system uses a **hybrid scoring function**:
|
| 82 |
+
final_score = Ξ± Β· semantic_similarity + Ξ² Β· label_overlap
|
| 83 |
+
|
| 84 |
+
- **Semantic similarity:** how close the issue texts are in embedding space
|
| 85 |
+
- **Label overlap:** how well predicted labels match existing issue labels
|
| 86 |
+
- **Ξ± / Ξ²:** tunable weights (precision-first by design)
|
| 87 |
+
|
| 88 |
+
Issues are also **deduplicated by issue number** to avoid repeated results.
|
| 89 |
+
|
| 90 |
---
|
| 91 |
|
| 92 |
+
## π― Design Principles
|
| 93 |
+
|
| 94 |
+
### Precision-First Retrieval
|
| 95 |
+
- The system may return fewer than *k* results intentionally
|
| 96 |
+
- It avoids hallucinating weakly related issues
|
| 97 |
+
- Returning **1β4 highly relevant issues** is considered a success
|
| 98 |
+
|
| 99 |
+
### Runtime FAISS Indexing
|
| 100 |
+
- FAISS indices are created at runtime
|
| 101 |
+
- Keeps datasets lightweight and portable on Hugging Face Hub
|
| 102 |
+
|
| 103 |
+
### Transparency
|
| 104 |
+
- Scores are explicitly shown:
|
| 105 |
+
- Semantic similarity
|
| 106 |
+
- Label overlap
|
| 107 |
+
- Final hybrid score
|
| 108 |
+
- GitHub URLs are fully visible and clickable
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## π¦ Models & Data
|
| 113 |
+
|
| 114 |
+
### Embedding Model
|
| 115 |
+
- `sentence-transformers/all-mpnet-base-v2`
|
| 116 |
+
|
| 117 |
+
### Multilabel Classifier
|
| 118 |
+
- DistilBERT fine-tuned for multilabel issue classification
|
| 119 |
+
- Hosted on Hugging Face Hub
|
| 120 |
+
|
| 121 |
+
### Dataset
|
| 122 |
+
- Custom GitHub issues dataset (scikit-learn)
|
| 123 |
+
- Pre-computed embeddings stored on Hugging Face Hub
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## π§ͺ Example Use Cases
|
| 128 |
+
|
| 129 |
+
- GitHub issue triage
|
| 130 |
+
- Bug deduplication
|
| 131 |
+
- Support ticket analysis
|
| 132 |
+
- Internal engineering knowledge search
|
| 133 |
+
- Maintainer productivity tools
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## π Tech Stack
|
| 138 |
+
|
| 139 |
+
- Python
|
| 140 |
+
- Hugging Face Datasets & Transformers
|
| 141 |
+
- Sentence Transformers
|
| 142 |
+
- FAISS
|
| 143 |
+
- Gradio (UI)
|
| 144 |
+
- Hugging Face Spaces
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## β¨ What This Project Demonstrates
|
| 149 |
+
|
| 150 |
+
- End-to-end ML system design (not just a model)
|
| 151 |
+
- Semantic search at scale
|
| 152 |
+
- Multilabel NLP in a real-world setting
|
| 153 |
+
- Hybrid ranking strategies
|
| 154 |
+
- Practical UX considerations for ML products
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## π Notes
|
| 159 |
+
|
| 160 |
+
This project goes **beyond tutorial-level demos** by focusing on:
|
| 161 |
+
- Real datasets
|
| 162 |
+
- Production constraints
|
| 163 |
+
- Explainable ranking behavior
|
| 164 |
+
- Clean, user-facing presentation
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
## π Acknowledgements
|
| 169 |
+
|
| 170 |
+
Inspired by real GitHub workflows and the Hugging Face ecosystem.
|