Shuu12121 commited on
Commit
bb9454a
·
verified ·
1 Parent(s): e4e7457

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -46
README.md CHANGED
@@ -29,75 +29,145 @@ tags:
29
  - go
30
  ---
31
 
 
32
 
33
- # **🦉 CodeModernBERT-Owl v1.0: 高精度なコード検索 & コード理解モデル**
34
 
35
- **CodeModernBERT-Owl v1.0** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
 
36
 
37
- This model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
38
 
39
- ## **🛠️ 主な特徴 / Key Features**
40
-
41
- - **Supports long sequences up to 8192 tokens** (training used up to 2048)
42
- - **Optimized for code search, code understanding, and code clone detection**
43
- - **Achieves top-tier performance across multiple languages**
44
- - **Multi-language support**: Python, PHP, Java, JavaScript, Go, Ruby, and Rust
45
- - **Mean pooling performs significantly better than CLS token on this model**
46
 
47
  ---
48
 
49
  ## **📊 モデルパラメータ / Model Parameters**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- | パラメータ / Parameter | 値 / Value |
52
- | ------------------------------- | ------------------------- |
53
- | vocab\_size | 50,004 |
54
- | hidden\_size | 768 |
55
- | num\_hidden\_layers | 12 |
56
- | num\_attention\_heads | 12 |
57
- | intermediate\_size | 3,072 |
58
- | max\_position\_embeddings | 8,192 (trained with 2048) |
59
- | type\_vocab\_size | 2 |
60
- | hidden\_dropout\_prob | 0.1 |
61
- | attention\_probs\_dropout\_prob | 0.1 |
62
- | local\_attention\_window | 128 |
63
- | rope\_theta | 160,000 |
64
- | local\_attention\_rope\_theta | 10,000 |
65
 
66
  ---
67
 
68
- ## **📊 言語別 MRR 比較 (Mean Pooling)**
69
 
70
- - 実験は CodeSearchNet の test split を使用して実施しました。
71
- - 候補プールサイズは 100 に固定し、言語ごとの性能を測定しました。
 
72
 
 
73
 
74
- | 言語 / Language | **CodeModernBERT-Owl-1.0** | CodeT5+ | GraphCodeBERT | CodeBERTa-small | CodeBERT |
75
- | ------------- | ---------------------- | ------- | ------------- | --------------- | -------- |
76
- | Python | **0.8936** | 0.8048 | 0.3496 | 0.6123 | 0.0927 |
77
- | Java | **0.8479** | 0.7853 | 0.3299 | 0.4738 | 0.0816 |
78
- | JavaScript | **0.7711** | 0.7111 | 0.2581 | 0.3593 | 0.0692 |
79
- | PHP | **0.8056** | 0.7893 | 0.2507 | 0.4533 | 0.0623 |
80
- | Ruby | **0.7993** | 0.7201 | 0.3186 | 0.4418 | 0.0762 |
81
- | Go | **0.8426** | 0.7577 | 0.4453 | 0.5338 | 0.0856 |
 
82
 
83
- CodeModernBERT-Owl-1.0 (Mean Pooling) achieves the best MRR across all evaluated languages.
 
 
84
 
85
  ---
86
 
87
- ## **📝 結論 / Conclusion**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
- - **Top performance in all languages**
90
- - **Rust support successfully added through dataset augmentation**
91
- - **Mean pooling is significantly more effective than CLS embedding**
92
- - **Further performance improvements possible with better datasets**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
  ---
95
 
96
  ## **📜 ライセンス / License**
97
-
98
- 📄 Apache-2.0
99
 
100
  ## **📧 連絡先 / Contact**
101
-
102
- 📩 For any questions, please contact:
103
- 📧 [shun0212114@outlook.jp](mailto:shun0212114@outlook.jp)
 
29
  - go
30
  ---
31
 
32
+ # **🦉CodeModernBERT-Owl**
33
 
34
+ ## **概要 / Overview**
35
 
36
+ ### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**
37
+ **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
38
 
39
+ Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
40
 
41
+ ### **🛠 主な特徴 / Key Features**
42
+ ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
43
+ **Optimized for code search, code understanding, and code clone detection**
44
+ **Fine-tuned on GitHub open-source repositories (Java, Rust)**
45
+ **Achieves the highest accuracy among the CodeHawks/CodeMorph series**
46
+ **Multi-language support**: **Python, PHP, Java, JavaScript, Go, Ruby, and Rust**
 
47
 
48
  ---
49
 
50
  ## **📊 モデルパラメータ / Model Parameters**
51
+ | パラメータ / Parameter | 値 / Value |
52
+ |-------------------------|------------|
53
+ | **vocab_size** | 50,004 |
54
+ | **hidden_size** | 768 |
55
+ | **num_hidden_layers** | 12 |
56
+ | **num_attention_heads**| 12 |
57
+ | **intermediate_size** | 3,072 |
58
+ | **max_position_embeddings** | 2,048 |
59
+ | **type_vocab_size** | 2 |
60
+ | **hidden_dropout_prob**| 0.1 |
61
+ | **attention_probs_dropout_prob** | 0.1 |
62
+ | **local_attention_window** | 128 |
63
+ | **rope_theta** | 160,000 |
64
+ | **local_attention_rope_theta** | 10,000 |
65
+
66
+ ---
67
+
68
+ ## **💻 モデルの使用方法 / How to Use**
69
+ This model can be easily loaded using the **Hugging Face Transformers** library.
70
+
71
+ ⚠️ **Requires transformers >= 4.48.0**
72
+
73
+ 🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**
74
+
75
+ ### **モデルのロード / Load the Model**
76
+ ```python
77
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
78
+ tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
79
+ model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
80
+ ```
81
+
82
+ ### **コード埋め込みの取得 / Get Code Embeddings**
83
+ ```python
84
+ import torch
85
+ def get_embedding(text, model, tokenizer, device="cuda"):
86
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
87
+ if "token_type_ids" in inputs:
88
+ inputs.pop("token_type_ids")
89
+ inputs = {k: v.to(device) for k, v in inputs.items()}
90
+ outputs = model.model(**inputs)
91
+ embedding = outputs.last_hidden_state[:, 0, :]
92
+ return embedding
93
+ embedding = get_embedding("def my_function(): pass", model, tokenizer)
94
+ print(embedding.shape)
95
+ ```
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  ---
99
 
100
+ # **🔍 評価結果 / Evaluation Results**
101
 
102
+ ### **データセット / Dataset**
103
+ 📌 **Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.**
104
+ 📌 **Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.**
105
 
106
+ ---
107
 
108
+ ## **📈 主要な評価指標の比較(同一シード値)/ Key Evaluation Metrics (Same Seed)**
109
+ | 言語 / Language | **CodeModernBERT-Owl** | **CodeHawks-ModernBERT** | **Salesforce CodeT5+** | **Microsoft CodeBERT** | **GraphCodeBERT** |
110
+ |-----------|-----------------|----------------------|-----------------|------------------|------------------|
111
+ | **Python** | **0.8793** | 0.8551 | 0.8266 | 0.5243 | 0.5493 |
112
+ | **Java** | **0.8880** | 0.7971 | **0.8867** | 0.3134 | 0.5879 |
113
+ | **JavaScript** | **0.8423** | 0.7634 | 0.7628 | 0.2694 | 0.5051 |
114
+ | **PHP** | **0.9129** | 0.8578 | **0.9027** | 0.2642 | 0.6225 |
115
+ | **Ruby** | **0.8038** | 0.7469 | **0.7568** | 0.3318 | 0.5876 |
116
+ | **Go** | **0.9386** | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
117
 
118
+ **Achieves the highest accuracy in all target languages.**
119
+ ✅ **Significantly improved Java accuracy using additional fine-tuned GitHub data.**
120
+ ✅ **Outperforms previous models, especially in PHP and Go.**
121
 
122
  ---
123
 
124
+ ## **📊 Rust (独自データセット) / Rust Performance**
125
+ | 指標 / Metric | **CodeModernBERT-Owl** |
126
+ |--------------|----------------|
127
+ | **MRR** | 0.7940 |
128
+ | **MAP** | 0.7940 |
129
+ | **R-Precision** | 0.7173 |
130
+
131
+ ### **📌 K別評価指標 / Evaluation Metrics by K**
132
+ | K | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
133
+ |----|-------------|---------------|------------|--------|-----------------|-----------------|
134
+ | **1** | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 |
135
+ | **5** | 0.8913 | 0.7852 | 0.8118 | 0.8132 | 0.8913 | 0.8913 |
136
+ | **10** | 0.9333 | 0.7908 | 0.8254 | 0.8230 | 0.9333 | 0.9333 |
137
+ | **50** | 0.9887 | 0.7938 | 0.8383 | 0.8288 | 0.9887 | 0.9887 |
138
+ | **100** | 1.0000 | 0.7940 | 0.8401 | 0.8291 | 1.0000 | 1.0000 |
139
 
140
+ ---
141
+
142
+ ## **🔁 別のおすすめモデル / Recommended Alternative Models**
143
+
144
+ ### 1. **CodeSearch-ModernBERT-Owl🦉** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Owl)
145
+ If you need a model that is **more specialized for code search**, this model is highly recommended.
146
+ コードサーチに**特化したモデルが必要な場合**はこちらがおすすめです。
147
+
148
+ ### 2. **CodeModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeModernBERT-Snake)
149
+ If you need a pretrained model that supports **longer sequences or a smaller model size**, this model is ideal.
150
+ **シーケンス長が長い**、または**モデルサイズが小さい**事前学習済みモデルが必要な場合はこちらをおすすめします。
151
+ - **Maximum Sequence Length:** 8192 tokens
152
+ - **Smaller Model Size:** ~75M parameters
153
+
154
+ ### 3. **CodeSearch-ModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Snake)
155
+ For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
156
+ **コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
157
+ - **Maximum Sequence Length:** 8192 tokens
158
+ - **High Code Search Performance**
159
+
160
+
161
+ ## **📝 結論 / Conclusion**
162
+ ✅ **Top performance in all languages**
163
+ ✅ **Rust support successfully added through dataset augmentation**
164
+ ✅ **Further performance improvements possible with better datasets**
165
 
166
  ---
167
 
168
  ## **📜 ライセンス / License**
169
+ 📄 **Apache-2.0**
 
170
 
171
  ## **📧 連絡先 / Contact**
172
+ 📩 **For any questions, please contact:**
173
+ 📧 **shun0212114@outlook.jp**