File size: 1,564 Bytes
fdc8930
7171285
 
 
 
 
 
fdc8930
7171285
 
 
 
fdc8930
7171285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---

language: en
tags:
- code
- semantic-search
- jepa
- code-search
license: mit
datasets:
- claudios/code_search_net
metrics:
- mrr
---


# Repo-JEPA: Semantic Code Navigator (SOTA 0.90 MRR)

A **Joint Embedding Predictive Architecture** (JEPA) for semantic code search, trained on 411,000 real Python functions using an NVIDIA H100.

## 🏆 Performance

Tested on 1,000 unseen real-world Python functions from CodeSearchNet.

| Metric | Result | Target |
|--------|--------|--------|
| **MRR** | **0.9052** | 0.60 |
| **Hits@1** | **86.2%** | - |
| **Hits@5** | **95.9%** | - |
| **Hits@10** | **97.3%** | - |
| **Median Rank** | **1.0** | - |

## 🧩 Usage (AutoModel)

```python

from transformers import AutoModel, AutoTokenizer



# 1. Load Model

model = AutoModel.from_pretrained("uddeshya-k/RepoJepa", trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")





# 2. Encode Code

code = "def handle_login(user): return auth.verify(user)"

code_embed = model.encode_code(**tokenizer(code, return_tensors="pt"))



# 3. Encode Query

query = "how to authenticate users?"

query_embed = model.encode_query(**tokenizer(query, return_tensors="pt"))



# 4. Search

similarity = (code_embed @ query_embed.T).item()

print(f"Similarity: {similarity:.4f}")

```

## 🏗️ Technical Details

- **Backbone**: CodeBERT (RoBERTa-style)
- **Loss**: VICReg (Variance-Invariance-Covariance Regularization)
- **Hardware**: NVIDIA H100 PCIe (80GB VRAM)
- **Optimizer**: AdamW + OneCycleLR