Feature Extraction
Transformers
Safetensors
code
roberta
code-search
semantic-search
graphcodebert
erlang
cpp
text-embeddings-inference
Instructions to use MatthewsO3/GraphCode-CErl-codesearch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MatthewsO3/GraphCode-CErl-codesearch with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="MatthewsO3/GraphCode-CErl-codesearch")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("MatthewsO3/GraphCode-CErl-codesearch") model = AutoModel.from_pretrained("MatthewsO3/GraphCode-CErl-codesearch") - Notebooks
- Google Colab
- Kaggle
File size: 7,229 Bytes
e2e2197 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | ---
language:
- code
license: mit
base_model: microsoft/graphcodebert-base
tags:
- code-search
- semantic-search
- graphcodebert
- erlang
- cpp
library_name: transformers
pipeline_tag: feature-extraction
---
# GraphCode-CErl — Semantic Code Search for Erlang & C++
Fine-tuned [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for semantic code search over **Erlang** and **C++** codebases. Given a natural language query, the model retrieves the most semantically relevant functions from an indexed repository.
## Model Description
This is a bi-encoder trained with contrastive learning. It encodes both natural language queries and code snippets into a shared embedding space, enabling efficient cosine-similarity-based retrieval at search time.
- **Base model:** `microsoft/graphcodebert-base`
- **Architecture:** GraphCodeBERT encoder with mean pooling + L2 normalization (no LM head)
- **Languages trained on:** Erlang, C++
- **Task:** Semantic code search / function retrieval
### Architecture detail
The model wraps the GraphCodeBERT encoder in a lightweight `CodeSearchModel`:
```python
# Mean pooling over all token positions (not CLS)
def mean_pooling(last_hidden_state, attention_mask):
mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
return torch.sum(last_hidden_state * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
```
Embeddings are L2-normalized, so retrieval is a plain dot product (equivalent to cosine similarity).
---
## Training
### Data
Training triplets were constructed from two sources:
| Language | Source | Records |
|----------|--------|---------|
| C++ | [`codeparrot/xlcost-text-to-code`](https://huggingface.co/datasets/codeparrot/xlcost-text-to-code) (C++-program-level) | 8,650 |
| Erlang | Private dataset (not released) | — |
Each record is a `(code, good_docstring, bad1_docstring, bad2_docstring)` tuple. Negatives were mined as follows:
- **60% hard negatives** — BM25-retrieved docstrings that are lexically similar to the positive but semantically wrong (top-20 BM25 candidates, sampled randomly)
- **30% cross-language negatives** — docstrings sampled from the opposite language to discourage language-specific shortcuts
- **10% random negatives** — uniform random docstrings as easy negatives
### Loss
Temperature-scaled cross-entropy over augmented scores. For each batch the score matrix is extended with both negatives:
```
augmented_scores = [good_scores | bad1_scores | bad2_scores]
loss = CrossEntropyLoss(augmented_scores / τ, diagonal_labels)
```
where `τ = 0.05`.
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Base model | `microsoft/graphcodebert-base` |
| Batch size | 32 |
| Epochs | 10 |
| Learning rate | 2e-5 |
| LR schedule | Linear warmup (10%) → linear decay to 0 |
| Optimizer | AdamW |
| Gradient clipping | 1.0 |
| Code max length | 256 tokens |
| NL max length | 128 tokens |
| Temperature (τ) | 0.05 |
| Early stopping patience | 3 (not triggered) |
| Seed | 42 |
### Training curve
| Epoch | Loss |
|-------|------|
| 1 | 1.4135 |
| 2 | 0.4685 |
| 3 | 0.3438 |
| 4 | 0.2738 |
| 5 | 0.2308 |
| 6 | 0.1997 |
| 7 | 0.1671 |
| 8 | 0.1507 |
| 9 | 0.1425 |
| **10** | **0.1348** ← best |
Training ran for all 10 epochs without triggering early stopping (patience = 3). Best model saved at epoch 10.
---
## Usage
This model is intended to be used with [`code_search.py`](https://github.com/MatthewsO3/GraphCode-CErl-base/tree/main/Code%20Search), a unified indexing and search tool included in the repository.
### Quick start
```bash
git clone https://github.com/MatthewsO3/GraphCode-CErl-base
cd "GraphCode-CErl-base/Code Search/Evaluation"
python setup.py # creates .venv, installs deps, builds erlang.so
source .venv/bin/activate
# Index a repository (auto-discovers Erlang + C++ + Python)
python code_search.py index \
--repo /path/to/your/repo \
--model MatthewsO3/GraphCode-CErl-codesearch \
--output corpus.jsonl \
--index corpus_index.pt
# Search interactively
python code_search.py search \
--model MatthewsO3/GraphCode-CErl-codesearch \
--jsonl corpus.jsonl \
--index corpus_index.pt \
--top 5
```
Language-specific flags are also available and can be combined freely:
```bash
# Erlang only
python code_search.py index --erlang /path/to/erl_repo ...
# C++ only
python code_search.py index --cpp /path/to/cpp_repo ...
# Explicit mix
python code_search.py index --erlang /path/erl --cpp /path/cpp --python /path/py ...
```
### Using the model directly
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")
model = AutoModel.from_pretrained("MatthewsO3/GraphCode-CErl-codesearch")
model.eval()
def encode(texts):
enc = tokenizer(texts, return_tensors="pt", truncation=True,
padding=True, max_length=256)
with torch.no_grad():
out = model(**enc)
# Mean pooling
mask = enc["attention_mask"].unsqueeze(-1).float()
emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
return emb / emb.norm(dim=1, keepdim=True)
query = encode(["handle TCP connection timeout"])
code = encode(["handle_timeout(Socket, State) -> gen_tcp:close(Socket), {stop, timeout, State}."])
score = (query @ code.T).item()
print(f"Similarity: {score:.4f}")
```
> **Note:** The tokenizer is loaded from `microsoft/graphcodebert-base` since it is identical to the fine-tuned model's tokenizer and avoids a redundant download.
---
## Supported Languages
| Language | Extractor | Extensions |
|----------|-----------|------------|
| Erlang | tree-sitter (WhatsApp grammar) + custom `ErlangParser` + regex fallback | `.erl`, `.hrl` |
| C++ | tree-sitter + regex fallback | `.cpp`, `.cc`, `.cxx`, `.c`, `.h`, `.hpp` |
| Python | tree-sitter + regex fallback | `.py` |
> **Note:** Python indexing is supported by `code_search.py` but the model was not trained on Python data. Results for Python queries may be less accurate.
---
## Limitations
- Not trained on Python — cross-language transfer to Python is best-effort
- The Erlang training set is private and not released
- Functions without docstrings or comments are embedded on code tokens alone, which may reduce retrieval accuracy for ambiguous natural language queries
- Running on CPU is fully supported but slow for large corpora at index-build time; a GPU is recommended
---
## Repository
Training code, indexing tool, and setup scripts are available at:
[github.com/MatthewsO3/GraphCode-CErl-base](https://github.com/MatthewsO3/GraphCode-CErl-base)
---
## Citation
If you use this model, please cite the original GraphCodeBERT paper:
```bibtex
@inproceedings{guo2021graphcodebert,
title = {GraphCodeBERT: Pre-training Code Representations with Data Flow},
author = {Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu
and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey
and Fu, Shengyu and others},
booktitle = {International Conference on Learning Representations},
year = {2021}
}
``` |