Instructions to use aysinghal/ide-code-retrieval-gpt2-large-llm2vec with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use aysinghal/ide-code-retrieval-gpt2-large-llm2vec with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-retrieval
- embeddings
base_model: openai/gpt2-large
datasets:
- aysinghal/code-retrieval-training-dataset
pipeline_tag: sentence-similarity
ide-code-retrieval-gpt2-large-llm2vec
A SentenceTransformer model fine-tuned from openai/gpt2-large for IDE code retrieval -- mapping natural-language commit queries to relevant source code documents via dense vector similarity.
Note: This is an intermediate checkpoint at step 0 / 0 (0.0% through 3 epochs). Training loss is still decreasing, so a later checkpoint may perform better.
Model Description
This model encodes both short natural-language queries (commit messages, search queries) and longer code documents into a shared embedding space. Retrieval is performed by computing cosine similarity between the query embedding and candidate code embeddings.
- Base model: openai/gpt2-large (0.6B parameters)
- Max sequence length: 512 tokens
- Output dimensionality: 1024 (normalized)
- Similarity function: Cosine similarity
Training Details
Dataset
- Source: aysinghal/code-retrieval-training-dataset
- Total pairs: 5,032,350
- Train split: 4,780,732 pairs (95%)
- Eval split: 251,618 pairs (5%)
- Text strategy: truncate (max 4096 chars)
- Negatives: Explicit hard negatives from the dataset
- Pre-tokenized: Yes (token IDs stored on disk for zero-overhead data loading)
Loss Function
MultipleNegativesRankingLoss (InfoNCE) with explicit hard negatives. Each training example consists of an anchor (query), a positive (relevant code), and a hard negative (similar but irrelevant code). In-batch negatives provide additional contrast.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | openai/gpt2-large |
| Learning rate | 2e-05 |
| LR schedule | Linear with warmup |
| Warmup ratio | 0.1 |
| Epochs | 3 |
| Effective batch size | 256 |
| Per-GPU batch size | 64 |
| Gradient accumulation | 1 |
| Max sequence length | 512 tokens |
| Precision | BFloat16 |
| Gradient checkpointing | True |
| torch.compile | Enabled (max-autotune) |
| Seed | 42 |
| Eval strategy | Every 915 steps |
| Early stopping patience | 3 |
Hardware
- GPUs: 4x NVIDIA L40S
- Total training steps: 0 (3 epochs)
Training Progress (at checkpoint step 0)
- Progress: 0 / 0 steps (0.0%)
Full training loss history (click to expand)
Usage
Loading the Model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec")
Computing Embeddings
queries = [
"fix null pointer exception in user authentication",
"add retry logic to API client",
]
code_docs = [
"def authenticate(user):\n if user is None:\n raise ValueError...",
"class APIClient:\n def request(self, url, retries=3):\n ...",
]
query_embeddings = model.encode(queries)
code_embeddings = model.encode(code_docs)
# Compute cosine similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embeddings, code_embeddings)
print(similarities)
Intended Use
- Primary use case: Retrieving relevant code files/functions given a natural-language query (commit message, bug description, feature request)
- Search pipeline: Encode a corpus of code documents offline, then at query time encode the query and find nearest neighbors via cosine similarity
Limitations
- This is an early checkpoint (0.0% through training). The loss curve is still decreasing, so later checkpoints will likely perform better.
- Trained on a specific code retrieval dataset; may not generalize to all programming languages or query styles without further fine-tuning.
- Max context is 512 tokens -- very long files are truncated.
Citation
If you use this model, please cite the base model:
@article{qwen3embedding,
title={Qwen3-Embedding},
author={Qwen Team},
year={2025}
}