--- language: en tags: - code - semantic-search - jepa - code-search license: mit datasets: - claudios/code_search_net metrics: - mrr --- # Repo-JEPA: Semantic Code Navigator (SOTA 0.90 MRR) A **Joint Embedding Predictive Architecture** (JEPA) for semantic code search, trained on 411,000 real Python functions using an NVIDIA H100. ## 🏆 Performance Tested on 1,000 unseen real-world Python functions from CodeSearchNet. | Metric | Result | Target | |--------|--------|--------| | **MRR** | **0.9052** | 0.60 | | **Hits@1** | **86.2%** | - | | **Hits@5** | **95.9%** | - | | **Hits@10** | **97.3%** | - | | **Median Rank** | **1.0** | - | ## 🧩 Usage (AutoModel) ```python from transformers import AutoModel, AutoTokenizer # 1. Load Model model = AutoModel.from_pretrained("uddeshya-k/RepoJepa", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base") # 2. Encode Code code = "def handle_login(user): return auth.verify(user)" code_embed = model.encode_code(**tokenizer(code, return_tensors="pt")) # 3. Encode Query query = "how to authenticate users?" query_embed = model.encode_query(**tokenizer(query, return_tensors="pt")) # 4. Search similarity = (code_embed @ query_embed.T).item() print(f"Similarity: {similarity:.4f}") ``` ## 🏗️ Technical Details - **Backbone**: CodeBERT (RoBERTa-style) - **Loss**: VICReg (Variance-Invariance-Covariance Regularization) - **Hardware**: NVIDIA H100 PCIe (80GB VRAM) - **Optimizer**: AdamW + OneCycleLR