--- license: mit title: ๐Ÿค— GitHub Issue Hybrid Search & Auto-Label Assistant sdk: gradio colorFrom: blue colorTo: green pinned: true thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/68281bec37b367f53ec121da/jbr9Hs4azEMPGbTZ2dYIz.png short_description: Hybrid semantic search and auto-labeling for GitHub issues sdk_version: 6.2.0 --- # ๐Ÿค— GitHub Issue Hybrid Search & Auto-Label Assistant **Precision-first hybrid ranking for real GitHub issues using Semantic Search + Multilabel Classification.** This project demonstrates a **production-oriented hybrid retrieval system** that helps with **GitHub issue triage** by combining: - Dense semantic search (MPNet embeddings + FAISS) - Multilabel text classification (DistilBERT) - A hybrid ranking strategy that fuses semantic similarity and label consistency A live, interactive demo is available via **Hugging Face Spaces**. --- ## ๐Ÿš€ Live Demo ๐Ÿ”— **Hugging Face Space:** *GitHub Issue Hybrid Search & Auto-Label Assistant* Users can describe a GitHub issue in natural language and instantly: - See predicted issue labels (e.g. `Bug`, `Needs Triage`) - Retrieve the most relevant existing GitHub issues - Inspect semantic similarity, label overlap, and final hybrid scores - Open the original GitHub issues directly --- ## ๐Ÿ” Problem Motivation Large open-source repositories receive **thousands of issues**, making it hard to: - Find similar historical issues - Detect duplicates - Assign correct labels early - Route issues to the right maintainers Keyword search alone is often insufficient. This project addresses that gap with **semantic + label-aware retrieval**. --- ## ๐Ÿง  System Overview The pipeline consists of four main stages: ### 1. Semantic Encoding User queries are encoded using dense sentence embeddings: - **Model:** `sentence-transformers/all-mpnet-base-v2` Issue texts in the dataset are pre-embedded and stored for fast retrieval. --- ### 2. Semantic Retrieval - **Index:** FAISS (runtime-built) - Retrieves the nearest issues based on vector similarity - Optimized for dense semantic matching rather than keywords --- ### 3. Multilabel Classification A fine-tuned DistilBERT model predicts issue labels: - Example labels: `Bug`, `Needs Triage`, `module:ensemble`, `module:tree` - Multiple labels can be assigned per issue - Outputs confidence-based label predictions --- ### 4. Hybrid Ranking (Key Contribution) Semantic similarity alone is not always enough. This system uses a **hybrid scoring function**: final_score = ฮฑ ยท semantic_similarity + ฮฒ ยท label_overlap - **Semantic similarity:** how close the issue texts are in embedding space - **Label overlap:** how well predicted labels match existing issue labels - **ฮฑ / ฮฒ:** tunable weights (precision-first by design) Issues are also **deduplicated by issue number** to avoid repeated results. --- ## ๐ŸŽฏ Design Principles ### Precision-First Retrieval - The system may return fewer than *k* results intentionally - It avoids hallucinating weakly related issues - Returning **1โ€“4 highly relevant issues** is considered a success ### Runtime FAISS Indexing - FAISS indices are created at runtime - Keeps datasets lightweight and portable on Hugging Face Hub ### Transparency - Scores are explicitly shown: - Semantic similarity - Label overlap - Final hybrid score - GitHub URLs are fully visible and clickable --- ## ๐Ÿ“ฆ Models & Data ### Embedding Model - `sentence-transformers/all-mpnet-base-v2` ### Multilabel Classifier - DistilBERT fine-tuned for multilabel issue classification - Hosted on Hugging Face Hub ### Dataset - Custom GitHub issues dataset (scikit-learn) - Pre-computed embeddings stored on Hugging Face Hub --- ## ๐Ÿงช Example Use Cases - GitHub issue triage - Bug deduplication - Support ticket analysis - Internal engineering knowledge search - Maintainer productivity tools --- ## ๐Ÿ›  Tech Stack - Python - Hugging Face Datasets & Transformers - Sentence Transformers - FAISS - Gradio (UI) - Hugging Face Spaces --- ## โœจ What This Project Demonstrates - End-to-end ML system design (not just a model) - Semantic search at scale - Multilabel NLP in a real-world setting - Hybrid ranking strategies - Practical UX considerations for ML products --- ## ๐Ÿ“Œ Notes This project goes **beyond tutorial-level demos** by focusing on: - Real datasets - Production constraints - Explainable ranking behavior - Clean, user-facing presentation --- ## ๐Ÿ™Œ Acknowledgements Inspired by real GitHub workflows and the Hugging Face ecosystem.