Sentence Similarity
sentence-transformers
Safetensors
English
gpt2
feature-extraction
code-retrieval
embeddings
Instructions to use aysinghal/ide-code-retrieval-gpt2-large-llm2vec with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use aysinghal/ide-code-retrieval-gpt2-large-llm2vec with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
File size: 4,494 Bytes
d2d6d69 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-retrieval
- embeddings
base_model: openai/gpt2-large
datasets:
- aysinghal/code-retrieval-training-dataset
pipeline_tag: sentence-similarity
---
# ide-code-retrieval-gpt2-large-llm2vec
A [SentenceTransformer](https://www.sbert.net/) model fine-tuned from
[openai/gpt2-large](https://huggingface.co/openai/gpt2-large) for **IDE code retrieval** --
mapping natural-language commit queries to relevant source code documents via
dense vector similarity.
> **Note:** This is an intermediate checkpoint at step 0 / 0
> (0.0% through 3 epochs). Training loss is still decreasing,
> so a later checkpoint may perform better.
## Model Description
This model encodes both short natural-language queries (commit messages, search
queries) and longer code documents into a shared embedding space. Retrieval is
performed by computing cosine similarity between the query embedding and
candidate code embeddings.
- **Base model:** [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) (0.6B parameters)
- **Max sequence length:** 512 tokens
- **Output dimensionality:** 1024 (normalized)
- **Similarity function:** Cosine similarity
## Training Details
### Dataset
- **Source:** [aysinghal/code-retrieval-training-dataset](https://huggingface.co/datasets/aysinghal/code-retrieval-training-dataset)
- **Total pairs:** 5,032,350
- **Train split:** 4,780,732 pairs (95%)
- **Eval split:** 251,618 pairs (5%)
- **Text strategy:** truncate (max 4096 chars)
- **Negatives:** Explicit hard negatives from the dataset
- **Pre-tokenized:** Yes (token IDs stored on disk for zero-overhead data loading)
### Loss Function
[MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
(InfoNCE) with explicit hard negatives. Each training example consists of an
anchor (query), a positive (relevant code), and a hard negative (similar but
irrelevant code). In-batch negatives provide additional contrast.
### Hyperparameters
| Parameter | Value |
|:---|:---|
| Base model | `openai/gpt2-large` |
| Learning rate | 2e-05 |
| LR schedule | Linear with warmup |
| Warmup ratio | 0.1 |
| Epochs | 3 |
| Effective batch size | 256 |
| Per-GPU batch size | 64 |
| Gradient accumulation | 1 |
| Max sequence length | 512 tokens |
| Precision | BFloat16 |
| Gradient checkpointing | True |
| torch.compile | Enabled (max-autotune) |
| Seed | 42 |
| Eval strategy | Every 915 steps |
| Early stopping patience | 3 |
### Hardware
- **GPUs:** 4x NVIDIA L40S
- **Total training steps:** 0 (3 epochs)
### Training Progress (at checkpoint step 0)
- **Progress:** 0 / 0 steps (0.0%)
<details>
<summary>Full training loss history (click to expand)</summary>
</details>
## Usage
### Loading the Model
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec")
```
### Computing Embeddings
```python
queries = [
"fix null pointer exception in user authentication",
"add retry logic to API client",
]
code_docs = [
"def authenticate(user):\n if user is None:\n raise ValueError...",
"class APIClient:\n def request(self, url, retries=3):\n ...",
]
query_embeddings = model.encode(queries)
code_embeddings = model.encode(code_docs)
# Compute cosine similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, code_embeddings)
print(similarities)
```
## Intended Use
- **Primary use case:** Retrieving relevant code files/functions given a
natural-language query (commit message, bug description, feature request)
- **Search pipeline:** Encode a corpus of code documents offline, then at query
time encode the query and find nearest neighbors via cosine similarity
## Limitations
- This is an **early checkpoint** (0.0% through training). The
loss curve is still decreasing, so later checkpoints will likely perform
better.
- Trained on a specific code retrieval dataset; may not generalize to all
programming languages or query styles without further fine-tuning.
- Max context is 512 tokens -- very long
files are truncated.
## Citation
If you use this model, please cite the base model:
```bibtex
@article{qwen3embedding,
title={Qwen3-Embedding},
author={Qwen Team},
year={2025}
}
```
|