aysinghal commited on
Commit
d2d6d69
·
verified ·
1 Parent(s): bb3a275

Final model after 9150 steps

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1280,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - code-retrieval
11
+ - embeddings
12
+ base_model: openai/gpt2-large
13
+ datasets:
14
+ - aysinghal/code-retrieval-training-dataset
15
+ pipeline_tag: sentence-similarity
16
+ ---
17
+
18
+ # ide-code-retrieval-gpt2-large-llm2vec
19
+
20
+ A [SentenceTransformer](https://www.sbert.net/) model fine-tuned from
21
+ [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) for **IDE code retrieval** --
22
+ mapping natural-language commit queries to relevant source code documents via
23
+ dense vector similarity.
24
+
25
+ > **Note:** This is an intermediate checkpoint at step 0 / 0
26
+ > (0.0% through 3 epochs). Training loss is still decreasing,
27
+ > so a later checkpoint may perform better.
28
+
29
+ ## Model Description
30
+
31
+ This model encodes both short natural-language queries (commit messages, search
32
+ queries) and longer code documents into a shared embedding space. Retrieval is
33
+ performed by computing cosine similarity between the query embedding and
34
+ candidate code embeddings.
35
+
36
+ - **Base model:** [openai/gpt2-large](https://huggingface.co/openai/gpt2-large) (0.6B parameters)
37
+ - **Max sequence length:** 512 tokens
38
+ - **Output dimensionality:** 1024 (normalized)
39
+ - **Similarity function:** Cosine similarity
40
+
41
+ ## Training Details
42
+
43
+ ### Dataset
44
+
45
+ - **Source:** [aysinghal/code-retrieval-training-dataset](https://huggingface.co/datasets/aysinghal/code-retrieval-training-dataset)
46
+ - **Total pairs:** 5,032,350
47
+ - **Train split:** 4,780,732 pairs (95%)
48
+ - **Eval split:** 251,618 pairs (5%)
49
+ - **Text strategy:** truncate (max 4096 chars)
50
+ - **Negatives:** Explicit hard negatives from the dataset
51
+ - **Pre-tokenized:** Yes (token IDs stored on disk for zero-overhead data loading)
52
+
53
+ ### Loss Function
54
+
55
+ [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
56
+ (InfoNCE) with explicit hard negatives. Each training example consists of an
57
+ anchor (query), a positive (relevant code), and a hard negative (similar but
58
+ irrelevant code). In-batch negatives provide additional contrast.
59
+
60
+ ### Hyperparameters
61
+
62
+ | Parameter | Value |
63
+ |:---|:---|
64
+ | Base model | `openai/gpt2-large` |
65
+ | Learning rate | 2e-05 |
66
+ | LR schedule | Linear with warmup |
67
+ | Warmup ratio | 0.1 |
68
+ | Epochs | 3 |
69
+ | Effective batch size | 256 |
70
+ | Per-GPU batch size | 64 |
71
+ | Gradient accumulation | 1 |
72
+ | Max sequence length | 512 tokens |
73
+ | Precision | BFloat16 |
74
+ | Gradient checkpointing | True |
75
+ | torch.compile | Enabled (max-autotune) |
76
+ | Seed | 42 |
77
+ | Eval strategy | Every 915 steps |
78
+ | Early stopping patience | 3 |
79
+
80
+ ### Hardware
81
+
82
+ - **GPUs:** 4x NVIDIA L40S
83
+ - **Total training steps:** 0 (3 epochs)
84
+
85
+ ### Training Progress (at checkpoint step 0)
86
+
87
+ - **Progress:** 0 / 0 steps (0.0%)
88
+
89
+ <details>
90
+ <summary>Full training loss history (click to expand)</summary>
91
+
92
+
93
+
94
+ </details>
95
+
96
+ ## Usage
97
+
98
+ ### Loading the Model
99
+
100
+ ```python
101
+ from sentence_transformers import SentenceTransformer
102
+
103
+ model = SentenceTransformer("aysinghal/ide-code-retrieval-gpt2-large-llm2vec")
104
+ ```
105
+
106
+ ### Computing Embeddings
107
+
108
+ ```python
109
+ queries = [
110
+ "fix null pointer exception in user authentication",
111
+ "add retry logic to API client",
112
+ ]
113
+ code_docs = [
114
+ "def authenticate(user):\n if user is None:\n raise ValueError...",
115
+ "class APIClient:\n def request(self, url, retries=3):\n ...",
116
+ ]
117
+
118
+ query_embeddings = model.encode(queries)
119
+ code_embeddings = model.encode(code_docs)
120
+
121
+ # Compute cosine similarities
122
+ from sentence_transformers.util import cos_sim
123
+ similarities = cos_sim(query_embeddings, code_embeddings)
124
+ print(similarities)
125
+ ```
126
+
127
+ ## Intended Use
128
+
129
+ - **Primary use case:** Retrieving relevant code files/functions given a
130
+ natural-language query (commit message, bug description, feature request)
131
+ - **Search pipeline:** Encode a corpus of code documents offline, then at query
132
+ time encode the query and find nearest neighbors via cosine similarity
133
+
134
+ ## Limitations
135
+
136
+ - This is an **early checkpoint** (0.0% through training). The
137
+ loss curve is still decreasing, so later checkpoints will likely perform
138
+ better.
139
+ - Trained on a specific code retrieval dataset; may not generalize to all
140
+ programming languages or query styles without further fine-tuning.
141
+ - Max context is 512 tokens -- very long
142
+ files are truncated.
143
+
144
+ ## Citation
145
+
146
+ If you use this model, please cite the base model:
147
+
148
+ ```bibtex
149
+ @article{qwen3embedding,
150
+ title={Qwen3-Embedding},
151
+ author={Qwen Team},
152
+ year={2025}
153
+ }
154
+ ```
config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./output/run_20260520_131023_truncate_hard/final_model",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2Model"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 1280,
16
+ "n_head": 20,
17
+ "n_inner": null,
18
+ "n_layer": 36,
19
+ "n_positions": 1024,
20
+ "pad_token_id": 50256,
21
+ "reorder_and_upcast_attn": false,
22
+ "resid_pdrop": 0.1,
23
+ "scale_attn_by_inverse_layer_idx": false,
24
+ "scale_attn_weights": true,
25
+ "summary_activation": null,
26
+ "summary_first_dropout": 0.1,
27
+ "summary_proj_to_labels": true,
28
+ "summary_type": "cls_index",
29
+ "summary_use_proj": true,
30
+ "task_specific_params": {
31
+ "text-generation": {
32
+ "do_sample": true,
33
+ "max_length": 50
34
+ }
35
+ },
36
+ "torch_dtype": "float32",
37
+ "transformers_version": "4.44.2",
38
+ "use_cache": true,
39
+ "vocab_size": 50257
40
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.2.3",
5
+ "transformers": "4.44.2",
6
+ "pytorch": "2.10.0+cu128"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a7a930788e60678f749bfb7649c65ca947c61d1d900ee28e63f39058d75423c5
3
+ size 3096160696
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|endoftext|>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": true,
15
+ "eos_token": "<|endoftext|>",
16
+ "max_length": 512,
17
+ "model_max_length": 512,
18
+ "pad_to_multiple_of": null,
19
+ "pad_token": "<|endoftext|>",
20
+ "pad_token_type_id": 0,
21
+ "padding_side": "right",
22
+ "stride": 0,
23
+ "tokenizer_class": "GPT2Tokenizer",
24
+ "truncation_side": "right",
25
+ "truncation_strategy": "longest_first",
26
+ "unk_token": "<|endoftext|>"
27
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff