Sentence Similarity
sentence-transformers
Safetensors
English
gpt2
feature-extraction
code-retrieval
embeddings
Instructions to use aysinghal/ide-code-retrieval-gpt2-large-llm2vec with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use aysinghal/ide-code-retrieval-gpt2-large-llm2vec with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: sentence-transformers | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - code-retrieval | |
| - embeddings | |
| base_model: openai/gpt2-large | |
| datasets: | |
| - aysinghal/code-retrieval-training-dataset | |
| pipeline_tag: sentence-similarity | |
| # ide-code-retrieval-gpt2-large-llm2vec | |
| A [SentenceTransformer](https://www.sbert.net/) model fine-tuned from | |
| [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) for **IDE code retrieval** -- | |
| mapping natural-language commit queries to relevant source code documents via | |
| dense vector similarity. | |
| > **Note:** This is an intermediate checkpoint at step 0 / 0 | |
| > (0.0% through 3 epochs). Training loss is still decreasing, | |
| > so a later checkpoint may perform better. | |
| ## Model Description | |
| This model encodes both short natural-language queries (commit messages, search | |
| queries) and longer code documents into a shared embedding space. Retrieval is | |
| performed by computing cosine similarity between the query embedding and | |
| candidate code embeddings. | |
| - **Base model:** [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) (0.6B parameters) | |
| - **Max sequence length:** 512 tokens | |
| - **Output dimensionality:** 1024 (normalized) | |
| - **Similarity function:** Cosine similarity | |
| ## Training Details | |
| ### Dataset | |
| - **Source:** [aysinghal/code-retrieval-training-dataset](https://huggingface.co/datasets/aysinghal/code-retrieval-training-dataset) | |
| - **Total pairs:** 5,032,350 | |
| - **Train split:** 4,780,732 pairs (95%) | |
| - **Eval split:** 251,618 pairs (5%) | |
| - **Text strategy:** truncate (max 4096 chars) | |
| - **Negatives:** Explicit hard negatives from the dataset | |
| - **Pre-tokenized:** Yes (token IDs stored on disk for zero-overhead data loading) | |
| ### Loss Function | |
| [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) | |
| (InfoNCE) with explicit hard negatives. Each training example consists of an | |
| anchor (query), a positive (relevant code), and a hard negative (similar but | |
| irrelevant code). In-batch negatives provide additional contrast. | |
| ### Hyperparameters | |
| | Parameter | Value | | |
| |:---|:---| | |
| | Base model | `openai/gpt2-large` | | |
| | Learning rate | 2e-05 | | |
| | LR schedule | Linear with warmup | | |
| | Warmup ratio | 0.1 | | |
| | Epochs | 3 | | |
| | Effective batch size | 256 | | |
| | Per-GPU batch size | 64 | | |
| | Gradient accumulation | 1 | | |
| | Max sequence length | 512 tokens | | |
| | Precision | BFloat16 | | |
| | Gradient checkpointing | True | | |
| | torch.compile | Enabled (max-autotune) | | |
| | Seed | 42 | | |
| | Eval strategy | Every 915 steps | | |
| | Early stopping patience | 3 | | |
| ### Hardware | |
| - **GPUs:** 4x NVIDIA L40S | |
| - **Total training steps:** 0 (3 epochs) | |
| ### Training Progress (at checkpoint step 0) | |
| - **Progress:** 0 / 0 steps (0.0%) | |
| <details> | |
| <summary>Full training loss history (click to expand)</summary> | |
| </details> | |
| ## Usage | |
| ### Loading the Model | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec") | |
| ``` | |
| ### Computing Embeddings | |
| ```python | |
| queries = [ | |
| "fix null pointer exception in user authentication", | |
| "add retry logic to API client", | |
| ] | |
| code_docs = [ | |
| "def authenticate(user):\n if user is None:\n raise ValueError...", | |
| "class APIClient:\n def request(self, url, retries=3):\n ...", | |
| ] | |
| query_embeddings = model.encode(queries) | |
| code_embeddings = model.encode(code_docs) | |
| # Compute cosine similarities | |
| from sentence_transformers.util import cos_sim | |
| similarities = cos_sim(query_embeddings, code_embeddings) | |
| print(similarities) | |
| ``` | |
| ## Intended Use | |
| - **Primary use case:** Retrieving relevant code files/functions given a | |
| natural-language query (commit message, bug description, feature request) | |
| - **Search pipeline:** Encode a corpus of code documents offline, then at query | |
| time encode the query and find nearest neighbors via cosine similarity | |
| ## Limitations | |
| - This is an **early checkpoint** (0.0% through training). The | |
| loss curve is still decreasing, so later checkpoints will likely perform | |
| better. | |
| - Trained on a specific code retrieval dataset; may not generalize to all | |
| programming languages or query styles without further fine-tuning. | |
| - Max context is 512 tokens -- very long | |
| files are truncated. | |
| ## Citation | |
| If you use this model, please cite the base model: | |
| ```bibtex | |
| @article{qwen3embedding, | |
| title={Qwen3-Embedding}, | |
| author={Qwen Team}, | |
| year={2025} | |
| } | |
| ``` | |