File size: 4,494 Bytes
d2d6d69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-retrieval
- embeddings
base_model: openai/gpt2-large
datasets:
- aysinghal/code-retrieval-training-dataset
pipeline_tag: sentence-similarity
---

# ide-code-retrieval-gpt2-large-llm2vec

A [SentenceTransformer](https://www.sbert.net/) model fine-tuned from
[openai/gpt2-large](https://huggingface.co/openai/gpt2-large) for **IDE code retrieval** --
mapping natural-language commit queries to relevant source code documents via
dense vector similarity.

> **Note:** This is an intermediate checkpoint at step 0 / 0
> (0.0% through 3 epochs). Training loss is still decreasing,
> so a later checkpoint may perform better.

## Model Description

This model encodes both short natural-language queries (commit messages, search
queries) and longer code documents into a shared embedding space. Retrieval is
performed by computing cosine similarity between the query embedding and
candidate code embeddings.

- **Base model:** [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) (0.6B parameters)
- **Max sequence length:** 512 tokens
- **Output dimensionality:** 1024 (normalized)
- **Similarity function:** Cosine similarity

## Training Details

### Dataset

- **Source:** [aysinghal/code-retrieval-training-dataset](https://huggingface.co/datasets/aysinghal/code-retrieval-training-dataset)
- **Total pairs:** 5,032,350
- **Train split:** 4,780,732 pairs (95%)
- **Eval split:** 251,618 pairs (5%)
- **Text strategy:** truncate (max 4096 chars)
- **Negatives:** Explicit hard negatives from the dataset
- **Pre-tokenized:** Yes (token IDs stored on disk for zero-overhead data loading)

### Loss Function

[MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
(InfoNCE) with explicit hard negatives. Each training example consists of an
anchor (query), a positive (relevant code), and a hard negative (similar but
irrelevant code). In-batch negatives provide additional contrast.

### Hyperparameters

| Parameter | Value |
|:---|:---|
| Base model | `openai/gpt2-large` |
| Learning rate | 2e-05 |
| LR schedule | Linear with warmup |
| Warmup ratio | 0.1 |
| Epochs | 3 |
| Effective batch size | 256 |
| Per-GPU batch size | 64 |
| Gradient accumulation | 1 |
| Max sequence length | 512 tokens |
| Precision | BFloat16 |
| Gradient checkpointing | True |
| torch.compile | Enabled (max-autotune) |
| Seed | 42 |
| Eval strategy | Every 915 steps |
| Early stopping patience | 3 |

### Hardware

- **GPUs:** 4x NVIDIA L40S
- **Total training steps:** 0 (3 epochs)

### Training Progress (at checkpoint step 0)

- **Progress:** 0 / 0 steps (0.0%)

<details>
<summary>Full training loss history (click to expand)</summary>



</details>

## Usage

### Loading the Model

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec")
```

### Computing Embeddings

```python
queries = [
    "fix null pointer exception in user authentication",
    "add retry logic to API client",
]
code_docs = [
    "def authenticate(user):\n    if user is None:\n        raise ValueError...",
    "class APIClient:\n    def request(self, url, retries=3):\n        ...",
]

query_embeddings = model.encode(queries)
code_embeddings = model.encode(code_docs)

# Compute cosine similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, code_embeddings)
print(similarities)
```

## Intended Use

- **Primary use case:** Retrieving relevant code files/functions given a
  natural-language query (commit message, bug description, feature request)
- **Search pipeline:** Encode a corpus of code documents offline, then at query
  time encode the query and find nearest neighbors via cosine similarity

## Limitations

- This is an **early checkpoint** (0.0% through training). The
  loss curve is still decreasing, so later checkpoints will likely perform
  better.
- Trained on a specific code retrieval dataset; may not generalize to all
  programming languages or query styles without further fine-tuning.
- Max context is 512 tokens -- very long
  files are truncated.

## Citation

If you use this model, please cite the base model:

```bibtex
@article{qwen3embedding,
  title={Qwen3-Embedding},
  author={Qwen Team},
  year={2025}
}
```