GraphCode-CErl — Semantic Code Search for Erlang & C++

Fine-tuned GraphCodeBERT for semantic code search over Erlang and C++ codebases. Given a natural language query, the model retrieves the most semantically relevant functions from an indexed repository.

Model Description

This is a bi-encoder trained with contrastive learning. It encodes both natural language queries and code snippets into a shared embedding space, enabling efficient cosine-similarity-based retrieval at search time.

Base model: microsoft/graphcodebert-base
Architecture: GraphCodeBERT encoder with mean pooling + L2 normalization (no LM head)
Languages trained on: Erlang, C++
Task: Semantic code search / function retrieval

Architecture detail

The model wraps the GraphCodeBERT encoder in a lightweight CodeSearchModel:

# Mean pooling over all token positions (not CLS)
def mean_pooling(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    return torch.sum(last_hidden_state * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

Embeddings are L2-normalized, so retrieval is a plain dot product (equivalent to cosine similarity).

Training

Data

Training triplets were constructed from two sources:

Language	Source	Records
C++	`codeparrot/xlcost-text-to-code` (C++-program-level)	8,650
Erlang	Private dataset (not released)	—

Each record is a (code, good_docstring, bad1_docstring, bad2_docstring) tuple. Negatives were mined as follows:

60% hard negatives — BM25-retrieved docstrings that are lexically similar to the positive but semantically wrong (top-20 BM25 candidates, sampled randomly)
30% cross-language negatives — docstrings sampled from the opposite language to discourage language-specific shortcuts
10% random negatives — uniform random docstrings as easy negatives

Loss

Temperature-scaled cross-entropy over augmented scores. For each batch the score matrix is extended with both negatives:

augmented_scores = [good_scores | bad1_scores | bad2_scores]
loss = CrossEntropyLoss(augmented_scores / τ, diagonal_labels)

where τ = 0.05.

Hyperparameters

Parameter	Value
Base model	`microsoft/graphcodebert-base`
Batch size	32
Epochs	10
Learning rate	2e-5
LR schedule	Linear warmup (10%) → linear decay to 0
Optimizer	AdamW
Gradient clipping	1.0
Code max length	256 tokens
NL max length	128 tokens
Temperature (τ)	0.05
Early stopping patience	3 (not triggered)
Seed	42

Training curve

Epoch	Loss
1	1.4135
2	0.4685
3	0.3438
4	0.2738
5	0.2308
6	0.1997
7	0.1671
8	0.1507
9	0.1425
10	0.1348 ← best

Training ran for all 10 epochs without triggering early stopping (patience = 3). Best model saved at epoch 10.

Usage

This model is intended to be used with code_search.py, a unified indexing and search tool included in the repository.

Quick start

git clone https://github.com/MatthewsO3/GraphCode-CErl-base
cd "GraphCode-CErl-base/Code Search/Evaluation"
python setup.py          # creates .venv, installs deps, builds erlang.so
source .venv/bin/activate

# Index a repository (auto-discovers Erlang + C++ + Python)
python code_search.py index \
    --repo /path/to/your/repo \
    --model MatthewsO3/GraphCode-CErl-codesearch \
    --output corpus.jsonl \
    --index corpus_index.pt

# Search interactively
python code_search.py search \
    --model MatthewsO3/GraphCode-CErl-codesearch \
    --jsonl corpus.jsonl \
    --index corpus_index.pt \
    --top 5

Language-specific flags are also available and can be combined freely:

# Erlang only
python code_search.py index --erlang /path/to/erl_repo ...

# C++ only
python code_search.py index --cpp /path/to/cpp_repo ...

# Explicit mix
python code_search.py index --erlang /path/erl --cpp /path/cpp --python /path/py ...

Using the model directly

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")
model = AutoModel.from_pretrained("MatthewsO3/GraphCode-CErl-codesearch")
model.eval()

def encode(texts):
    enc = tokenizer(texts, return_tensors="pt", truncation=True,
                    padding=True, max_length=256)
    with torch.no_grad():
        out = model(**enc)
    # Mean pooling
    mask = enc["attention_mask"].unsqueeze(-1).float()
    emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
    return emb / emb.norm(dim=1, keepdim=True)

query = encode(["handle TCP connection timeout"])
code  = encode(["handle_timeout(Socket, State) -> gen_tcp:close(Socket), {stop, timeout, State}."])

score = (query @ code.T).item()
print(f"Similarity: {score:.4f}")

Note: The tokenizer is loaded from microsoft/graphcodebert-base since it is identical to the fine-tuned model's tokenizer and avoids a redundant download.

Supported Languages

Language	Extractor	Extensions
Erlang	tree-sitter (WhatsApp grammar) + custom `ErlangParser` + regex fallback	`.erl`, `.hrl`
C++	tree-sitter + regex fallback	`.cpp`, `.cc`, `.cxx`, `.c`, `.h`, `.hpp`
Python	tree-sitter + regex fallback	`.py`

Note: Python indexing is supported by code_search.py but the model was not trained on Python data. Results for Python queries may be less accurate.

Limitations

Not trained on Python — cross-language transfer to Python is best-effort
The Erlang training set is private and not released
Functions without docstrings or comments are embedded on code tokens alone, which may reduce retrieval accuracy for ambiguous natural language queries
Running on CPU is fully supported but slow for large corpora at index-build time; a GPU is recommended

Repository

Training code, indexing tool, and setup scripts are available at: github.com/MatthewsO3/GraphCode-CErl-base

Citation

If you use this model, please cite the original GraphCodeBERT paper:

@inproceedings{guo2021graphcodebert,
  title     = {GraphCodeBERT: Pre-training Code Representations with Data Flow},
  author    = {Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu
               and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey
               and Fu, Shengyu and others},
  booktitle = {International Conference on Learning Representations},
  year      = {2021}
}

Downloads last month: 35

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for MatthewsO3/GraphCode-CErl-codesearch

Base model

microsoft/graphcodebert-base

Finetuned

(50)

this model