File size: 7,229 Bytes
e2e2197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
language:
  - code
license: mit
base_model: microsoft/graphcodebert-base
tags:
  - code-search
  - semantic-search
  - graphcodebert
  - erlang
  - cpp
library_name: transformers
pipeline_tag: feature-extraction
---

# GraphCode-CErl — Semantic Code Search for Erlang & C++

Fine-tuned [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for semantic code search over **Erlang** and **C++** codebases. Given a natural language query, the model retrieves the most semantically relevant functions from an indexed repository.

## Model Description

This is a bi-encoder trained with contrastive learning. It encodes both natural language queries and code snippets into a shared embedding space, enabling efficient cosine-similarity-based retrieval at search time.

- **Base model:** `microsoft/graphcodebert-base`
- **Architecture:** GraphCodeBERT encoder with mean pooling + L2 normalization (no LM head)
- **Languages trained on:** Erlang, C++
- **Task:** Semantic code search / function retrieval

### Architecture detail

The model wraps the GraphCodeBERT encoder in a lightweight `CodeSearchModel`:

```python
# Mean pooling over all token positions (not CLS)
def mean_pooling(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    return torch.sum(last_hidden_state * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
```

Embeddings are L2-normalized, so retrieval is a plain dot product (equivalent to cosine similarity).

---

## Training

### Data

Training triplets were constructed from two sources:

| Language | Source | Records |
|----------|--------|---------|
| C++ | [`codeparrot/xlcost-text-to-code`](https://huggingface.co/datasets/codeparrot/xlcost-text-to-code) (C++-program-level) | 8,650 |
| Erlang | Private dataset (not released) | — |

Each record is a `(code, good_docstring, bad1_docstring, bad2_docstring)` tuple. Negatives were mined as follows:
- **60% hard negatives** — BM25-retrieved docstrings that are lexically similar to the positive but semantically wrong (top-20 BM25 candidates, sampled randomly)
- **30% cross-language negatives** — docstrings sampled from the opposite language to discourage language-specific shortcuts
- **10% random negatives** — uniform random docstrings as easy negatives

### Loss

Temperature-scaled cross-entropy over augmented scores. For each batch the score matrix is extended with both negatives:

```
augmented_scores = [good_scores | bad1_scores | bad2_scores]
loss = CrossEntropyLoss(augmented_scores / τ, diagonal_labels)
```

where `τ = 0.05`.

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Base model | `microsoft/graphcodebert-base` |
| Batch size | 32 |
| Epochs | 10 |
| Learning rate | 2e-5 |
| LR schedule | Linear warmup (10%) → linear decay to 0 |
| Optimizer | AdamW |
| Gradient clipping | 1.0 |
| Code max length | 256 tokens |
| NL max length | 128 tokens |
| Temperature (τ) | 0.05 |
| Early stopping patience | 3 (not triggered) |
| Seed | 42 |

### Training curve

| Epoch | Loss |
|-------|------|
| 1 | 1.4135 |
| 2 | 0.4685 |
| 3 | 0.3438 |
| 4 | 0.2738 |
| 5 | 0.2308 |
| 6 | 0.1997 |
| 7 | 0.1671 |
| 8 | 0.1507 |
| 9 | 0.1425 |
| **10** | **0.1348** ← best |

Training ran for all 10 epochs without triggering early stopping (patience = 3). Best model saved at epoch 10.

---

## Usage

This model is intended to be used with [`code_search.py`](https://github.com/MatthewsO3/GraphCode-CErl-base/tree/main/Code%20Search), a unified indexing and search tool included in the repository.

### Quick start

```bash
git clone https://github.com/MatthewsO3/GraphCode-CErl-base
cd "GraphCode-CErl-base/Code Search/Evaluation"
python setup.py          # creates .venv, installs deps, builds erlang.so
source .venv/bin/activate

# Index a repository (auto-discovers Erlang + C++ + Python)
python code_search.py index \
    --repo /path/to/your/repo \
    --model MatthewsO3/GraphCode-CErl-codesearch \
    --output corpus.jsonl \
    --index corpus_index.pt

# Search interactively
python code_search.py search \
    --model MatthewsO3/GraphCode-CErl-codesearch \
    --jsonl corpus.jsonl \
    --index corpus_index.pt \
    --top 5
```

Language-specific flags are also available and can be combined freely:

```bash
# Erlang only
python code_search.py index --erlang /path/to/erl_repo ...

# C++ only
python code_search.py index --cpp /path/to/cpp_repo ...

# Explicit mix
python code_search.py index --erlang /path/erl --cpp /path/cpp --python /path/py ...
```

### Using the model directly

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")
model = AutoModel.from_pretrained("MatthewsO3/GraphCode-CErl-codesearch")
model.eval()

def encode(texts):
    enc = tokenizer(texts, return_tensors="pt", truncation=True,
                    padding=True, max_length=256)
    with torch.no_grad():
        out = model(**enc)
    # Mean pooling
    mask = enc["attention_mask"].unsqueeze(-1).float()
    emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
    return emb / emb.norm(dim=1, keepdim=True)

query = encode(["handle TCP connection timeout"])
code  = encode(["handle_timeout(Socket, State) -> gen_tcp:close(Socket), {stop, timeout, State}."])

score = (query @ code.T).item()
print(f"Similarity: {score:.4f}")
```

> **Note:** The tokenizer is loaded from `microsoft/graphcodebert-base` since it is identical to the fine-tuned model's tokenizer and avoids a redundant download.

---

## Supported Languages

| Language | Extractor | Extensions |
|----------|-----------|------------|
| Erlang | tree-sitter (WhatsApp grammar) + custom `ErlangParser` + regex fallback | `.erl`, `.hrl` |
| C++ | tree-sitter + regex fallback | `.cpp`, `.cc`, `.cxx`, `.c`, `.h`, `.hpp` |
| Python | tree-sitter + regex fallback | `.py` |

> **Note:** Python indexing is supported by `code_search.py` but the model was not trained on Python data. Results for Python queries may be less accurate.

---

## Limitations

- Not trained on Python — cross-language transfer to Python is best-effort
- The Erlang training set is private and not released
- Functions without docstrings or comments are embedded on code tokens alone, which may reduce retrieval accuracy for ambiguous natural language queries
- Running on CPU is fully supported but slow for large corpora at index-build time; a GPU is recommended

---

## Repository

Training code, indexing tool, and setup scripts are available at:
[github.com/MatthewsO3/GraphCode-CErl-base](https://github.com/MatthewsO3/GraphCode-CErl-base)

---

## Citation

If you use this model, please cite the original GraphCodeBERT paper:

```bibtex
@inproceedings{guo2021graphcodebert,
  title     = {GraphCodeBERT: Pre-training Code Representations with Data Flow},
  author    = {Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu
               and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey
               and Fu, Shengyu and others},
  booktitle = {International Conference on Learning Representations},
  year      = {2021}
}
```