---
license: mit
title: 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant
sdk: gradio
colorFrom: blue
colorTo: green
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
short_description: Hybrid semantic search and auto-labeling for GitHub issues
sdk_version: 6.2.0
---

# 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant

**Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.**

This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining:
- Dense semantic search (MPNet embeddings + FAISS)
- Multilabel text classification (DistilBERT)
- A hybrid ranking strategy that fuses semantic similarity and label consistency

A live, interactive demo is available via **Hugging Face Spaces**.

---

## 🚀 Live Demo

🔗 **Hugging Face Space:**  
*GitHub Issue Hybrid Search & Auto-Label Assistant*

Users can describe a GitHub issue in natural language and instantly:
- See predicted issue labels (e.g. `Bug`, `Needs Triage`)
- Retrieve the most relevant existing GitHub issues
- Inspect semantic similarity, label overlap, and final hybrid scores
- Open the original GitHub issues directly

---

## 🔍 Problem Motivation

Large open-source repositories receive **thousands of issues**, making it hard to:
- Find similar historical issues
- Detect duplicates
- Assign correct labels early
- Route issues to the right maintainers

Keyword search alone is often insufficient.  
This project addresses that gap with **semantic + label-aware retrieval**.

---

## 🧠 System Overview

The pipeline consists of four main stages:

### 1. Semantic Encoding
User queries are encoded using dense sentence embeddings:
- **Model:** `sentence-transformers/all-mpnet-base-v2`

Issue texts in the dataset are pre-embedded and stored for fast retrieval.

---

### 2. Semantic Retrieval
- **Index:** FAISS (runtime-built)
- Retrieves the nearest issues based on vector similarity
- Optimized for dense semantic matching rather than keywords

---

### 3. Multilabel Classification
A fine-tuned DistilBERT model predicts issue labels:
- Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree`
- Multiple labels can be assigned per issue
- Outputs confidence-based label predictions

---

### 4. Hybrid Ranking (Key Contribution)
Semantic similarity alone is not always enough.  
This system uses a **hybrid scoring function**:
final_score = α · semantic_similarity + β · label_overlap

- **Semantic similarity:** how close the issue texts are in embedding space
- **Label overlap:** how well predicted labels match existing issue labels
- **α / β:** tunable weights (precision-first by design)

Issues are also **deduplicated by issue number** to avoid repeated results.

---

## 🎯 Design Principles

### Precision-First Retrieval
- The system may return fewer than *k* results intentionally
- It avoids hallucinating weakly related issues
- Returning **1–4 highly relevant issues** is considered a success

### Runtime FAISS Indexing
- FAISS indices are created at runtime
- Keeps datasets lightweight and portable on Hugging Face Hub

### Transparency
- Scores are explicitly shown:
  - Semantic similarity
  - Label overlap
  - Final hybrid score
- GitHub URLs are fully visible and clickable

---

## 📦 Models & Data

### Embedding Model
- `sentence-transformers/all-mpnet-base-v2`

### Multilabel Classifier
- DistilBERT fine-tuned for multilabel issue classification
- Hosted on Hugging Face Hub

### Dataset
- Custom GitHub issues dataset (scikit-learn)
- Pre-computed embeddings stored on Hugging Face Hub

---

## 🧪 Example Use Cases

- GitHub issue triage
- Bug deduplication
- Support ticket analysis
- Internal engineering knowledge search
- Maintainer productivity tools

---

## 🛠 Tech Stack

- Python
- Hugging Face Datasets & Transformers
- Sentence Transformers
- FAISS
- Gradio (UI)
- Hugging Face Spaces

---

## ✨ What This Project Demonstrates

- End-to-end ML system design (not just a model)
- Semantic search at scale
- Multilabel NLP in a real-world setting
- Hybrid ranking strategies
- Practical UX considerations for ML products

---

## 📌 Notes

This project goes **beyond tutorial-level demos** by focusing on:
- Real datasets
- Production constraints
- Explainable ranking behavior
- Clean, user-facing presentation

---

## 🙌 Acknowledgements

Inspired by real GitHub workflows and the Hugging Face ecosystem.