Talip7's picture
Update README.md
220c409 verified

A newer version of the Gradio SDK is available: 6.16.0

Upgrade
metadata
license: mit
title: πŸ€— GitHub Issue Hybrid Search & Auto-Label Assistant
sdk: gradio
colorFrom: blue
colorTo: green
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png
short_description: Hybrid semantic search and auto-labeling for GitHub issues
sdk_version: 6.2.0

πŸ€— GitHub Issue Hybrid Search & Auto-Label Assistant

Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.

This project demonstrates a production-oriented hybrid retrieval system that helps with GitHub issue triage by combining:

  • Dense semantic search (MPNet embeddings + FAISS)
  • Multilabel text classification (DistilBERT)
  • A hybrid ranking strategy that fuses semantic similarity and label consistency

A live, interactive demo is available via Hugging Face Spaces.


πŸš€ Live Demo

πŸ”— Hugging Face Space:
GitHub Issue Hybrid Search & Auto-Label Assistant

Users can describe a GitHub issue in natural language and instantly:

  • See predicted issue labels (e.g. Bug, Needs Triage)
  • Retrieve the most relevant existing GitHub issues
  • Inspect semantic similarity, label overlap, and final hybrid scores
  • Open the original GitHub issues directly

πŸ” Problem Motivation

Large open-source repositories receive thousands of issues, making it hard to:

  • Find similar historical issues
  • Detect duplicates
  • Assign correct labels early
  • Route issues to the right maintainers

Keyword search alone is often insufficient.
This project addresses that gap with semantic + label-aware retrieval.


🧠 System Overview

The pipeline consists of four main stages:

1. Semantic Encoding

User queries are encoded using dense sentence embeddings:

  • Model: sentence-transformers/all-mpnet-base-v2

Issue texts in the dataset are pre-embedded and stored for fast retrieval.


2. Semantic Retrieval

  • Index: FAISS (runtime-built)
  • Retrieves the nearest issues based on vector similarity
  • Optimized for dense semantic matching rather than keywords

3. Multilabel Classification

A fine-tuned DistilBERT model predicts issue labels:

  • Example labels: Bug, Needs Triage, module:ensemble, module:tree
  • Multiple labels can be assigned per issue
  • Outputs confidence-based label predictions

4. Hybrid Ranking (Key Contribution)

Semantic similarity alone is not always enough.
This system uses a hybrid scoring function: final_score = Ξ± Β· semantic_similarity + Ξ² Β· label_overlap

  • Semantic similarity: how close the issue texts are in embedding space
  • Label overlap: how well predicted labels match existing issue labels
  • Ξ± / Ξ²: tunable weights (precision-first by design)

Issues are also deduplicated by issue number to avoid repeated results.


🎯 Design Principles

Precision-First Retrieval

  • The system may return fewer than k results intentionally
  • It avoids hallucinating weakly related issues
  • Returning 1–4 highly relevant issues is considered a success

Runtime FAISS Indexing

  • FAISS indices are created at runtime
  • Keeps datasets lightweight and portable on Hugging Face Hub

Transparency

  • Scores are explicitly shown:
    • Semantic similarity
    • Label overlap
    • Final hybrid score
  • GitHub URLs are fully visible and clickable

πŸ“¦ Models & Data

Embedding Model

  • sentence-transformers/all-mpnet-base-v2

Multilabel Classifier

  • DistilBERT fine-tuned for multilabel issue classification
  • Hosted on Hugging Face Hub

Dataset

  • Custom GitHub issues dataset (scikit-learn)
  • Pre-computed embeddings stored on Hugging Face Hub

πŸ§ͺ Example Use Cases

  • GitHub issue triage
  • Bug deduplication
  • Support ticket analysis
  • Internal engineering knowledge search
  • Maintainer productivity tools

πŸ›  Tech Stack

  • Python
  • Hugging Face Datasets & Transformers
  • Sentence Transformers
  • FAISS
  • Gradio (UI)
  • Hugging Face Spaces

✨ What This Project Demonstrates

  • End-to-end ML system design (not just a model)
  • Semantic search at scale
  • Multilabel NLP in a real-world setting
  • Hybrid ranking strategies
  • Practical UX considerations for ML products

πŸ“Œ Notes

This project goes beyond tutorial-level demos by focusing on:

  • Real datasets
  • Production constraints
  • Explainable ranking behavior
  • Clean, user-facing presentation

πŸ™Œ Acknowledgements

Inspired by real GitHub workflows and the Hugging Face ecosystem.