--- language: - en license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - code-retrieval - embeddings base_model: openai/gpt2-large datasets: - aysinghal/code-retrieval-training-dataset pipeline_tag: sentence-similarity --- # ide-code-retrieval-gpt2-large-llm2vec A [SentenceTransformer](https://www.sbert.net/) model fine-tuned from [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) for **IDE code retrieval** -- mapping natural-language commit queries to relevant source code documents via dense vector similarity. > **Note:** This is an intermediate checkpoint at step 0 / 0 > (0.0% through 3 epochs). Training loss is still decreasing, > so a later checkpoint may perform better. ## Model Description This model encodes both short natural-language queries (commit messages, search queries) and longer code documents into a shared embedding space. Retrieval is performed by computing cosine similarity between the query embedding and candidate code embeddings. - **Base model:** [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) (0.6B parameters) - **Max sequence length:** 512 tokens - **Output dimensionality:** 1024 (normalized) - **Similarity function:** Cosine similarity ## Training Details ### Dataset - **Source:** [aysinghal/code-retrieval-training-dataset](https://huggingface.co/datasets/aysinghal/code-retrieval-training-dataset) - **Total pairs:** 5,032,350 - **Train split:** 4,780,732 pairs (95%) - **Eval split:** 251,618 pairs (5%) - **Text strategy:** truncate (max 4096 chars) - **Negatives:** Explicit hard negatives from the dataset - **Pre-tokenized:** Yes (token IDs stored on disk for zero-overhead data loading) ### Loss Function [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) (InfoNCE) with explicit hard negatives. Each training example consists of an anchor (query), a positive (relevant code), and a hard negative (similar but irrelevant code). In-batch negatives provide additional contrast. ### Hyperparameters | Parameter | Value | |:---|:---| | Base model | `openai/gpt2-large` | | Learning rate | 2e-05 | | LR schedule | Linear with warmup | | Warmup ratio | 0.1 | | Epochs | 3 | | Effective batch size | 256 | | Per-GPU batch size | 64 | | Gradient accumulation | 1 | | Max sequence length | 512 tokens | | Precision | BFloat16 | | Gradient checkpointing | True | | torch.compile | Enabled (max-autotune) | | Seed | 42 | | Eval strategy | Every 915 steps | | Early stopping patience | 3 | ### Hardware - **GPUs:** 4x NVIDIA L40S - **Total training steps:** 0 (3 epochs) ### Training Progress (at checkpoint step 0) - **Progress:** 0 / 0 steps (0.0%)
Full training loss history (click to expand)
## Usage ### Loading the Model ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec") ``` ### Computing Embeddings ```python queries = [ "fix null pointer exception in user authentication", "add retry logic to API client", ] code_docs = [ "def authenticate(user):\n if user is None:\n raise ValueError...", "class APIClient:\n def request(self, url, retries=3):\n ...", ] query_embeddings = model.encode(queries) code_embeddings = model.encode(code_docs) # Compute cosine similarities from sentence_transformers.util import cos_sim similarities = cos_sim(query_embeddings, code_embeddings) print(similarities) ``` ## Intended Use - **Primary use case:** Retrieving relevant code files/functions given a natural-language query (commit message, bug description, feature request) - **Search pipeline:** Encode a corpus of code documents offline, then at query time encode the query and find nearest neighbors via cosine similarity ## Limitations - This is an **early checkpoint** (0.0% through training). The loss curve is still decreasing, so later checkpoints will likely perform better. - Trained on a specific code retrieval dataset; may not generalize to all programming languages or query styles without further fine-tuning. - Max context is 512 tokens -- very long files are truncated. ## Citation If you use this model, please cite the base model: ```bibtex @article{qwen3embedding, title={Qwen3-Embedding}, author={Qwen Team}, year={2025} } ```