aysinghal's picture
Final model after 9150 steps
d2d6d69 verified
---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-retrieval
- embeddings
base_model: openai/gpt2-large
datasets:
- aysinghal/code-retrieval-training-dataset
pipeline_tag: sentence-similarity
---
# ide-code-retrieval-gpt2-large-llm2vec
A [SentenceTransformer](https://www.sbert.net/) model fine-tuned from
[openai/gpt2-large](https://huggingface.co/openai/gpt2-large) for **IDE code retrieval** --
mapping natural-language commit queries to relevant source code documents via
dense vector similarity.
> **Note:** This is an intermediate checkpoint at step 0 / 0
> (0.0% through 3 epochs). Training loss is still decreasing,
> so a later checkpoint may perform better.
## Model Description
This model encodes both short natural-language queries (commit messages, search
queries) and longer code documents into a shared embedding space. Retrieval is
performed by computing cosine similarity between the query embedding and
candidate code embeddings.
- **Base model:** [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) (0.6B parameters)
- **Max sequence length:** 512 tokens
- **Output dimensionality:** 1024 (normalized)
- **Similarity function:** Cosine similarity
## Training Details
### Dataset
- **Source:** [aysinghal/code-retrieval-training-dataset](https://huggingface.co/datasets/aysinghal/code-retrieval-training-dataset)
- **Total pairs:** 5,032,350
- **Train split:** 4,780,732 pairs (95%)
- **Eval split:** 251,618 pairs (5%)
- **Text strategy:** truncate (max 4096 chars)
- **Negatives:** Explicit hard negatives from the dataset
- **Pre-tokenized:** Yes (token IDs stored on disk for zero-overhead data loading)
### Loss Function
[MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
(InfoNCE) with explicit hard negatives. Each training example consists of an
anchor (query), a positive (relevant code), and a hard negative (similar but
irrelevant code). In-batch negatives provide additional contrast.
### Hyperparameters
| Parameter | Value |
|:---|:---|
| Base model | `openai/gpt2-large` |
| Learning rate | 2e-05 |
| LR schedule | Linear with warmup |
| Warmup ratio | 0.1 |
| Epochs | 3 |
| Effective batch size | 256 |
| Per-GPU batch size | 64 |
| Gradient accumulation | 1 |
| Max sequence length | 512 tokens |
| Precision | BFloat16 |
| Gradient checkpointing | True |
| torch.compile | Enabled (max-autotune) |
| Seed | 42 |
| Eval strategy | Every 915 steps |
| Early stopping patience | 3 |
### Hardware
- **GPUs:** 4x NVIDIA L40S
- **Total training steps:** 0 (3 epochs)
### Training Progress (at checkpoint step 0)
- **Progress:** 0 / 0 steps (0.0%)
<details>
<summary>Full training loss history (click to expand)</summary>
</details>
## Usage
### Loading the Model
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec")
```
### Computing Embeddings
```python
queries = [
"fix null pointer exception in user authentication",
"add retry logic to API client",
]
code_docs = [
"def authenticate(user):\n if user is None:\n raise ValueError...",
"class APIClient:\n def request(self, url, retries=3):\n ...",
]
query_embeddings = model.encode(queries)
code_embeddings = model.encode(code_docs)
# Compute cosine similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, code_embeddings)
print(similarities)
```
## Intended Use
- **Primary use case:** Retrieving relevant code files/functions given a
natural-language query (commit message, bug description, feature request)
- **Search pipeline:** Encode a corpus of code documents offline, then at query
time encode the query and find nearest neighbors via cosine similarity
## Limitations
- This is an **early checkpoint** (0.0% through training). The
loss curve is still decreasing, so later checkpoints will likely perform
better.
- Trained on a specific code retrieval dataset; may not generalize to all
programming languages or query styles without further fine-tuning.
- Max context is 512 tokens -- very long
files are truncated.
## Citation
If you use this model, please cite the base model:
```bibtex
@article{qwen3embedding,
title={Qwen3-Embedding},
author={Qwen Team},
year={2025}
}
```