Spaces:

Talip7
/

github-issue-hybrid-search

Sleeping

App Files Files Community

github-issue-hybrid-search / README.md

Talip7

Update README.md

220c409 verified 5 months ago

preview code

raw

history blame contribute delete

4.58 kB

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

metadata

license: mit
title: 🤗 GitHub Issue Hybrid Search & Auto-Label Assistant
sdk: gradio
colorFrom: blue
colorTo: green
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
short_description: Hybrid semantic search and auto-labeling for GitHub issues
sdk_version: 6.2.0

🤗 GitHub Issue Hybrid Search & Auto-Label Assistant

Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.

This project demonstrates a production-oriented hybrid retrieval system that helps with GitHub issue triage by combining:

Dense semantic search (MPNet embeddings + FAISS)
Multilabel text classification (DistilBERT)
A hybrid ranking strategy that fuses semantic similarity and label consistency

A live, interactive demo is available via Hugging Face Spaces.

🚀 Live Demo

🔗 Hugging Face Space:
GitHub Issue Hybrid Search & Auto-Label Assistant

Users can describe a GitHub issue in natural language and instantly:

See predicted issue labels (e.g. Bug, Needs Triage)
Retrieve the most relevant existing GitHub issues
Inspect semantic similarity, label overlap, and final hybrid scores
Open the original GitHub issues directly

🔍 Problem Motivation

Large open-source repositories receive thousands of issues, making it hard to:

Find similar historical issues
Detect duplicates
Assign correct labels early
Route issues to the right maintainers

Keyword search alone is often insufficient.
This project addresses that gap with semantic + label-aware retrieval.

🧠 System Overview

The pipeline consists of four main stages:

1. Semantic Encoding

User queries are encoded using dense sentence embeddings:

Model: sentence-transformers/all-mpnet-base-v2

Issue texts in the dataset are pre-embedded and stored for fast retrieval.

2. Semantic Retrieval

Index: FAISS (runtime-built)
Retrieves the nearest issues based on vector similarity
Optimized for dense semantic matching rather than keywords

3. Multilabel Classification

A fine-tuned DistilBERT model predicts issue labels:

Example labels: Bug, Needs Triage, module:ensemble, module:tree
Multiple labels can be assigned per issue
Outputs confidence-based label predictions

4. Hybrid Ranking (Key Contribution)

Semantic similarity alone is not always enough.
This system uses a hybrid scoring function: final_score = α · semantic_similarity + β · label_overlap

Semantic similarity: how close the issue texts are in embedding space
Label overlap: how well predicted labels match existing issue labels
α / β: tunable weights (precision-first by design)

Issues are also deduplicated by issue number to avoid repeated results.

🎯 Design Principles

Precision-First Retrieval

The system may return fewer than k results intentionally
It avoids hallucinating weakly related issues
Returning 1–4 highly relevant issues is considered a success

Runtime FAISS Indexing

FAISS indices are created at runtime
Keeps datasets lightweight and portable on Hugging Face Hub

Transparency

Scores are explicitly shown:
- Semantic similarity
- Label overlap
- Final hybrid score
GitHub URLs are fully visible and clickable

📦 Models & Data

Embedding Model

sentence-transformers/all-mpnet-base-v2

Multilabel Classifier

DistilBERT fine-tuned for multilabel issue classification
Hosted on Hugging Face Hub

Dataset

Custom GitHub issues dataset (scikit-learn)
Pre-computed embeddings stored on Hugging Face Hub

🧪 Example Use Cases

GitHub issue triage
Bug deduplication
Support ticket analysis
Internal engineering knowledge search
Maintainer productivity tools

🛠 Tech Stack

Python
Hugging Face Datasets & Transformers
Sentence Transformers
FAISS
Gradio (UI)
Hugging Face Spaces

✨ What This Project Demonstrates

End-to-end ML system design (not just a model)
Semantic search at scale
Multilabel NLP in a real-world setting
Hybrid ranking strategies
Practical UX considerations for ML products

📌 Notes

This project goes beyond tutorial-level demos by focusing on:

Real datasets
Production constraints
Explainable ranking behavior
Clean, user-facing presentation

🙌 Acknowledgements

Inspired by real GitHub workflows and the Hugging Face ecosystem.