File size: 2,930 Bytes
5e1b955
 
c3ee549
 
 
 
 
 
 
 
 
5e1b955
 
54e46de
5e1b955
c3ee549
b8f0586
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3ee549
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
library_name: transformers
license: cc0-1.0
datasets:
- codemetic/curve
language:
- en
metrics:
- code_eval
base_model:
- microsoft/graphcodebert-base
---

# CWEBERT

This is the pretrained CWEBERT based on [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base), with about [5M C/C++ code corpus](https://huggingface.co/datasets/codemetic/curve/viewer/pretrain) MLM pretraining in 3 epochs.

## Getting Start

```python
import torch
from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch.nn.functional as F

# This program can test MLM pre-training effect with actual code snippets
# The following `code_to_test` provides an example, using <mask> to mask code tokens
# Predict the masked tokens and output corresponding confidence scores

# -----------------------------
# Input example, input C/C++ code,
# You can input your test code here.
# Note: use <mask> to replace the masked tokens
# -----------------------------
CODE_TO_TEST = """
int parseHeader(const std::<mask><int>& header, int index) {
    int size = header.size();
    int len = header[0];
    if (len > size) {
        return -1;
    }
    int pos = index + len;          
    return header[pos];
}
"""

# -----------------------------
# Configuration
# -----------------------------
MODEL_DIR = "cwebert-mlm"
TOP_K_TO_PREDICT = 5
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", DEVICE)

# -----------------------------
# Load model and tokenizer
# -----------------------------
tokenizer = RobertaTokenizer.from_pretrained(MODEL_DIR)
model = RobertaForMaskedLM.from_pretrained(MODEL_DIR).to(DEVICE)
model.eval()

masked_text = CODE_TO_TEST.replace("<mask>", tokenizer.mask_token)
inputs = tokenizer(masked_text, return_tensors="pt")
input_ids = inputs["input_ids"].to(DEVICE)

# -----------------------------
# Find mask position
# -----------------------------
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

# -----------------------------
# Inference
# -----------------------------
with torch.no_grad():
    logits = model(input_ids).logits

# Get logits at mask position
mask_logits = logits[0, mask_token_index, :][0]

# Apply softmax to get probabilities
probs = F.softmax(mask_logits, dim=-1)

# Get top-k predictions
top_probs, top_indices = torch.topk(probs, TOP_K_TO_PREDICT)

print("Top predictions for <mask>:")
for token_id, prob in zip(top_indices, top_probs):
    token_str = tokenizer.decode([token_id])
    print(f"{token_str:20s}  prob={prob.item():.6f}")

# -----------------------------
# Construct replaced code
# -----------------------------
best_token = tokenizer.decode([top_indices[0]])
predicted_code = masked_text.replace(tokenizer.mask_token, best_token)

print("\nPredicted most probably Code:\n", predicted_code)
```

## Downstream fine tuneing

You could fine-tune this pretrained cwebert-mlm for your downstream tasks.