Talip7 commited on
Commit
3fdde58
Β·
verified Β·
1 Parent(s): b320624

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -10
README.md CHANGED
@@ -1,14 +1,170 @@
1
  ---
2
- title: Github Issue Hybrid Search
3
- emoji: 🐒
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 6.2.0
8
- app_file: app.py
9
- pinned: false
10
  license: mit
11
- short_description: Hybrid semantic search and multilabel classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
 
 
2
  license: mit
3
+ title: πŸ€— GitHub Issue Hybrid Search & Auto-Label Assistant
4
+ sdk: gradio
5
+ colorFrom: blue
6
+ colorTo: green
7
+ pinned: true
8
+ thumbnail: >-
9
+ https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
10
+ short_description: Hybrid semantic search and auto-labeling for GitHub issues
11
+ ---
12
+
13
+ # πŸ€— GitHub Issue Hybrid Search & Auto-Label Assistant
14
+
15
+ **Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.**
16
+
17
+ This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining:
18
+ - Dense semantic search (MPNet embeddings + FAISS)
19
+ - Multilabel text classification (DistilBERT)
20
+ - A hybrid ranking strategy that fuses semantic similarity and label consistency
21
+
22
+ A live, interactive demo is available via **Hugging Face Spaces**.
23
+
24
+ ---
25
+
26
+ ## πŸš€ Live Demo
27
+
28
+ πŸ”— **Hugging Face Space:**
29
+ *GitHub Issue Hybrid Search & Auto-Label Assistant*
30
+
31
+ Users can describe a GitHub issue in natural language and instantly:
32
+ - See predicted issue labels (e.g. `Bug`, `Needs Triage`)
33
+ - Retrieve the most relevant existing GitHub issues
34
+ - Inspect semantic similarity, label overlap, and final hybrid scores
35
+ - Open the original GitHub issues directly
36
+
37
+ ---
38
+
39
+ ## πŸ” Problem Motivation
40
+
41
+ Large open-source repositories receive **thousands of issues**, making it hard to:
42
+ - Find similar historical issues
43
+ - Detect duplicates
44
+ - Assign correct labels early
45
+ - Route issues to the right maintainers
46
+
47
+ Keyword search alone is often insufficient.
48
+ This project addresses that gap with **semantic + label-aware retrieval**.
49
+
50
+ ---
51
+
52
+ ## 🧠 System Overview
53
+
54
+ The pipeline consists of four main stages:
55
+
56
+ ### 1. Semantic Encoding
57
+ User queries are encoded using dense sentence embeddings:
58
+ - **Model:** `sentence-transformers/all-mpnet-base-v2`
59
+
60
+ Issue texts in the dataset are pre-embedded and stored for fast retrieval.
61
+
62
+ ---
63
+
64
+ ### 2. Semantic Retrieval
65
+ - **Index:** FAISS (runtime-built)
66
+ - Retrieves the nearest issues based on vector similarity
67
+ - Optimized for dense semantic matching rather than keywords
68
+
69
+ ---
70
+
71
+ ### 3. Multilabel Classification
72
+ A fine-tuned DistilBERT model predicts issue labels:
73
+ - Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree`
74
+ - Multiple labels can be assigned per issue
75
+ - Outputs confidence-based label predictions
76
+
77
+ ---
78
+
79
+ ### 4. Hybrid Ranking (Key Contribution)
80
+ Semantic similarity alone is not always enough.
81
+ This system uses a **hybrid scoring function**:
82
+ final_score = Ξ± Β· semantic_similarity + Ξ² Β· label_overlap
83
+
84
+ - **Semantic similarity:** how close the issue texts are in embedding space
85
+ - **Label overlap:** how well predicted labels match existing issue labels
86
+ - **Ξ± / Ξ²:** tunable weights (precision-first by design)
87
+
88
+ Issues are also **deduplicated by issue number** to avoid repeated results.
89
+
90
  ---
91
 
92
+ ## 🎯 Design Principles
93
+
94
+ ### Precision-First Retrieval
95
+ - The system may return fewer than *k* results intentionally
96
+ - It avoids hallucinating weakly related issues
97
+ - Returning **1–4 highly relevant issues** is considered a success
98
+
99
+ ### Runtime FAISS Indexing
100
+ - FAISS indices are created at runtime
101
+ - Keeps datasets lightweight and portable on Hugging Face Hub
102
+
103
+ ### Transparency
104
+ - Scores are explicitly shown:
105
+ - Semantic similarity
106
+ - Label overlap
107
+ - Final hybrid score
108
+ - GitHub URLs are fully visible and clickable
109
+
110
+ ---
111
+
112
+ ## πŸ“¦ Models & Data
113
+
114
+ ### Embedding Model
115
+ - `sentence-transformers/all-mpnet-base-v2`
116
+
117
+ ### Multilabel Classifier
118
+ - DistilBERT fine-tuned for multilabel issue classification
119
+ - Hosted on Hugging Face Hub
120
+
121
+ ### Dataset
122
+ - Custom GitHub issues dataset (scikit-learn)
123
+ - Pre-computed embeddings stored on Hugging Face Hub
124
+
125
+ ---
126
+
127
+ ## πŸ§ͺ Example Use Cases
128
+
129
+ - GitHub issue triage
130
+ - Bug deduplication
131
+ - Support ticket analysis
132
+ - Internal engineering knowledge search
133
+ - Maintainer productivity tools
134
+
135
+ ---
136
+
137
+ ## πŸ›  Tech Stack
138
+
139
+ - Python
140
+ - Hugging Face Datasets & Transformers
141
+ - Sentence Transformers
142
+ - FAISS
143
+ - Gradio (UI)
144
+ - Hugging Face Spaces
145
+
146
+ ---
147
+
148
+ ## ✨ What This Project Demonstrates
149
+
150
+ - End-to-end ML system design (not just a model)
151
+ - Semantic search at scale
152
+ - Multilabel NLP in a real-world setting
153
+ - Hybrid ranking strategies
154
+ - Practical UX considerations for ML products
155
+
156
+ ---
157
+
158
+ ## πŸ“Œ Notes
159
+
160
+ This project goes **beyond tutorial-level demos** by focusing on:
161
+ - Real datasets
162
+ - Production constraints
163
+ - Explainable ranking behavior
164
+ - Clean, user-facing presentation
165
+
166
+ ---
167
+
168
+ ## πŸ™Œ Acknowledgements
169
+
170
+ Inspired by real GitHub workflows and the Hugging Face ecosystem.