aysinghal's picture
Final model after 9150 steps
d2d6d69 verified
metadata
language:
  - en
license: apache-2.0
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - code-retrieval
  - embeddings
base_model: openai/gpt2-large
datasets:
  - aysinghal/code-retrieval-training-dataset
pipeline_tag: sentence-similarity

ide-code-retrieval-gpt2-large-llm2vec

A SentenceTransformer model fine-tuned from openai/gpt2-large for IDE code retrieval -- mapping natural-language commit queries to relevant source code documents via dense vector similarity.

Note: This is an intermediate checkpoint at step 0 / 0 (0.0% through 3 epochs). Training loss is still decreasing, so a later checkpoint may perform better.

Model Description

This model encodes both short natural-language queries (commit messages, search queries) and longer code documents into a shared embedding space. Retrieval is performed by computing cosine similarity between the query embedding and candidate code embeddings.

  • Base model: openai/gpt2-large (0.6B parameters)
  • Max sequence length: 512 tokens
  • Output dimensionality: 1024 (normalized)
  • Similarity function: Cosine similarity

Training Details

Dataset

  • Source: aysinghal/code-retrieval-training-dataset
  • Total pairs: 5,032,350
  • Train split: 4,780,732 pairs (95%)
  • Eval split: 251,618 pairs (5%)
  • Text strategy: truncate (max 4096 chars)
  • Negatives: Explicit hard negatives from the dataset
  • Pre-tokenized: Yes (token IDs stored on disk for zero-overhead data loading)

Loss Function

MultipleNegativesRankingLoss (InfoNCE) with explicit hard negatives. Each training example consists of an anchor (query), a positive (relevant code), and a hard negative (similar but irrelevant code). In-batch negatives provide additional contrast.

Hyperparameters

Parameter Value
Base model openai/gpt2-large
Learning rate 2e-05
LR schedule Linear with warmup
Warmup ratio 0.1
Epochs 3
Effective batch size 256
Per-GPU batch size 64
Gradient accumulation 1
Max sequence length 512 tokens
Precision BFloat16
Gradient checkpointing True
torch.compile Enabled (max-autotune)
Seed 42
Eval strategy Every 915 steps
Early stopping patience 3

Hardware

  • GPUs: 4x NVIDIA L40S
  • Total training steps: 0 (3 epochs)

Training Progress (at checkpoint step 0)

  • Progress: 0 / 0 steps (0.0%)
Full training loss history (click to expand)

Usage

Loading the Model

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec")

Computing Embeddings

queries = [
    "fix null pointer exception in user authentication",
    "add retry logic to API client",
]
code_docs = [
    "def authenticate(user):\n    if user is None:\n        raise ValueError...",
    "class APIClient:\n    def request(self, url, retries=3):\n        ...",
]

query_embeddings = model.encode(queries)
code_embeddings = model.encode(code_docs)

# Compute cosine similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, code_embeddings)
print(similarities)

Intended Use

  • Primary use case: Retrieving relevant code files/functions given a natural-language query (commit message, bug description, feature request)
  • Search pipeline: Encode a corpus of code documents offline, then at query time encode the query and find nearest neighbors via cosine similarity

Limitations

  • This is an early checkpoint (0.0% through training). The loss curve is still decreasing, so later checkpoints will likely perform better.
  • Trained on a specific code retrieval dataset; may not generalize to all programming languages or query styles without further fine-tuning.
  • Max context is 512 tokens -- very long files are truncated.

Citation

If you use this model, please cite the base model:

@article{qwen3embedding,
  title={Qwen3-Embedding},
  author={Qwen Team},
  year={2025}
}