codemetic commited on
Commit
b8f0586
·
1 Parent(s): 54e46de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -5,4 +5,89 @@ tags: []
5
 
6
  # CWEBERT
7
 
8
- This is the pretrained CWEBERT based on [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base), with qbout [5M C/C++ code corpus](https://huggingface.co/datasets/codemetic/curve/viewer/pretrain) MLM pretraining.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  # CWEBERT
7
 
8
+ This is the pretrained CWEBERT based on [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base), with about [5M C/C++ code corpus](https://huggingface.co/datasets/codemetic/curve/viewer/pretrain) MLM pretraining.
9
+
10
+ ## Getting Start
11
+
12
+ ```python
13
+ import torch
14
+ from transformers import RobertaTokenizer, RobertaForMaskedLM
15
+ import torch.nn.functional as F
16
+
17
+ # This program can test MLM pre-training effect with actual code snippets
18
+ # The following `code_to_test` provides an example, using <mask> to mask code tokens
19
+ # Predict the masked tokens and output corresponding confidence scores
20
+
21
+ # -----------------------------
22
+ # Input example, input C/C++ code,
23
+ # You can input your test code here.
24
+ # Note: use <mask> to replace the masked tokens
25
+ # -----------------------------
26
+ CODE_TO_TEST = """
27
+ int parseHeader(const std::<mask><int>& header, int index) {
28
+ int size = header.size();
29
+ int len = header[0];
30
+ if (len > size) {
31
+ return -1;
32
+ }
33
+ int pos = index + len;
34
+ return header[pos];
35
+ }
36
+ """
37
+
38
+ # -----------------------------
39
+ # Configuration
40
+ # -----------------------------
41
+ MODEL_DIR = "cwebert-mlm"
42
+ TOP_K_TO_PREDICT = 5
43
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
44
+ print("Using device:", DEVICE)
45
+
46
+ # -----------------------------
47
+ # Load model and tokenizer
48
+ # -----------------------------
49
+ tokenizer = RobertaTokenizer.from_pretrained(MODEL_DIR)
50
+ model = RobertaForMaskedLM.from_pretrained(MODEL_DIR).to(DEVICE)
51
+ model.eval()
52
+
53
+ masked_text = CODE_TO_TEST.replace("<mask>", tokenizer.mask_token)
54
+ inputs = tokenizer(masked_text, return_tensors="pt")
55
+ input_ids = inputs["input_ids"].to(DEVICE)
56
+
57
+ # -----------------------------
58
+ # Find mask position
59
+ # -----------------------------
60
+ mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
61
+
62
+ # -----------------------------
63
+ # Inference
64
+ # -----------------------------
65
+ with torch.no_grad():
66
+ logits = model(input_ids).logits
67
+
68
+ # Get logits at mask position
69
+ mask_logits = logits[0, mask_token_index, :][0]
70
+
71
+ # Apply softmax to get probabilities
72
+ probs = F.softmax(mask_logits, dim=-1)
73
+
74
+ # Get top-k predictions
75
+ top_probs, top_indices = torch.topk(probs, TOP_K_TO_PREDICT)
76
+
77
+ print("Top predictions for <mask>:")
78
+ for token_id, prob in zip(top_indices, top_probs):
79
+ token_str = tokenizer.decode([token_id])
80
+ print(f"{token_str:20s} prob={prob.item():.6f}")
81
+
82
+ # -----------------------------
83
+ # Construct replaced code
84
+ # -----------------------------
85
+ best_token = tokenizer.decode([top_indices[0]])
86
+ predicted_code = masked_text.replace(tokenizer.mask_token, best_token)
87
+
88
+ print("\nPredicted most probably Code:\n", predicted_code)
89
+ ```
90
+
91
+ ## Downstream fine tuneing
92
+
93
+ You could fine-tune this pretrained cwebert-mlm for your downstream tasks.