Shuu12121 commited on
Commit
c6b9f91
·
verified ·
1 Parent(s): a0281ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -17
README.md CHANGED
@@ -30,17 +30,15 @@ tags:
30
  ---
31
 
32
 
33
- # **CodeModernBERT-Owl🦉**
34
 
35
  ## **概要 / Overview**
36
 
37
- ### **CodeModernBERT-Owl🦉: 高精度なコード検索 & コード理解モデル**
38
  **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
39
 
40
  Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
41
 
42
- ---
43
-
44
  ### **🛠 主な特徴 / Key Features**
45
  ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
46
  ✅ **Optimized for code search, code understanding, and code clone detection**
@@ -71,20 +69,20 @@ Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-M
71
  ## **💻 モデルの使用方法 / How to Use**
72
  This model can be easily loaded using the **Hugging Face Transformers** library.
73
 
74
- ⚠️ **Requires `transformers >= 4.48.0`**
75
 
76
  🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**
77
 
78
  ### **モデルのロード / Load the Model**
79
- ```python
80
  from transformers import AutoModelForMaskedLM, AutoTokenizer
81
 
82
  tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
83
  model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
84
- ```
85
 
86
  ### **コード埋め込みの取得 / Get Code Embeddings**
87
- ```python
88
  import torch
89
 
90
  def get_embedding(text, model, tokenizer, device="cuda"):
@@ -98,15 +96,15 @@ def get_embedding(text, model, tokenizer, device="cuda"):
98
 
99
  embedding = get_embedding("def my_function(): pass", model, tokenizer)
100
  print(embedding.shape)
101
- ```
102
 
103
  ---
104
 
105
  # **🔍 評価結果 / Evaluation Results**
106
 
107
  ### **データセット / Dataset**
108
- 📌 **Tested on `code_x_glue_ct_code_to_text` with a candidate pool size of 100.**
109
- 📌 **Rust-specific evaluations were conducted using `Shuu12121/rust-codesearch-dataset-open`.**
110
 
111
  ---
112
 
@@ -120,6 +118,28 @@ print(embedding.shape)
120
  | **Ruby** | **0.8038** | 0.7469 | **0.7568** | 0.3318 | 0.5876 |
121
  | **Go** | **0.9386** | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ---
124
 
125
  ## **🔁 別のおすすめモデル / Recommended Alternative Models**
@@ -138,15 +158,13 @@ If you need a pretrained model that supports **longer sequences or a smaller mod
138
  For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
139
  **コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
140
  - **Maximum Sequence Length:** 8192 tokens
141
- - **High Code Search Performance**
142
 
143
- ---
144
 
145
  ## **📝 結論 / Conclusion**
146
  ✅ **Top performance in all languages**
147
  ✅ **Rust support successfully added through dataset augmentation**
148
  ✅ **Further performance improvements possible with better datasets**
149
- ✅ **Recommended for various code search and understanding tasks**
150
 
151
  ---
152
 
@@ -155,6 +173,4 @@ For those looking for a model that combines **long sequence length and code sear
155
 
156
  ## **📧 連絡先 / Contact**
157
  📩 **For any questions, please contact:**
158
- 📧 **shun0212114@outlook.jp**
159
-
160
- ---
 
30
  ---
31
 
32
 
33
+ # **CodeModernBERT-Owl**
34
 
35
  ## **概要 / Overview**
36
 
37
+ ### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**
38
  **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
39
 
40
  Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
41
 
 
 
42
  ### **🛠 主な特徴 / Key Features**
43
  ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
44
  ✅ **Optimized for code search, code understanding, and code clone detection**
 
69
  ## **💻 モデルの使用方法 / How to Use**
70
  This model can be easily loaded using the **Hugging Face Transformers** library.
71
 
72
+ ⚠️ **Requires transformers >= 4.48.0**
73
 
74
  🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**
75
 
76
  ### **モデルのロード / Load the Model**
77
+ python
78
  from transformers import AutoModelForMaskedLM, AutoTokenizer
79
 
80
  tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
81
  model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
82
+
83
 
84
  ### **コード埋め込みの取得 / Get Code Embeddings**
85
+ python
86
  import torch
87
 
88
  def get_embedding(text, model, tokenizer, device="cuda"):
 
96
 
97
  embedding = get_embedding("def my_function(): pass", model, tokenizer)
98
  print(embedding.shape)
99
+
100
 
101
  ---
102
 
103
  # **🔍 評価結果 / Evaluation Results**
104
 
105
  ### **データセット / Dataset**
106
+ 📌 **Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.**
107
+ 📌 **Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.**
108
 
109
  ---
110
 
 
118
  | **Ruby** | **0.8038** | 0.7469 | **0.7568** | 0.3318 | 0.5876 |
119
  | **Go** | **0.9386** | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
120
 
121
+ ✅ **Achieves the highest accuracy in all target languages.**
122
+ ✅ **Significantly improved Java accuracy using additional fine-tuned GitHub data.**
123
+ ✅ **Outperforms previous models, especially in PHP and Go.**
124
+
125
+ ---
126
+
127
+ ## **📊 Rust (独自データセット) / Rust Performance**
128
+ | 指標 / Metric | **CodeModernBERT-Owl** |
129
+ |--------------|----------------|
130
+ | **MRR** | 0.7940 |
131
+ | **MAP** | 0.7940 |
132
+ | **R-Precision** | 0.7173 |
133
+
134
+ ### **📌 K別評価指標 / Evaluation Metrics by K**
135
+ | K | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
136
+ |----|-------------|---------------|------------|--------|-----------------|-----------------|
137
+ | **1** | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 |
138
+ | **5** | 0.8913 | 0.7852 | 0.8118 | 0.8132 | 0.8913 | 0.8913 |
139
+ | **10** | 0.9333 | 0.7908 | 0.8254 | 0.8230 | 0.9333 | 0.9333 |
140
+ | **50** | 0.9887 | 0.7938 | 0.8383 | 0.8288 | 0.9887 | 0.9887 |
141
+ | **100** | 1.0000 | 0.7940 | 0.8401 | 0.8291 | 1.0000 | 1.0000 |
142
+
143
  ---
144
 
145
  ## **🔁 別のおすすめモデル / Recommended Alternative Models**
 
158
  For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
159
  **コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
160
  - **Maximum Sequence Length:** 8192 tokens
161
+ - **High Code Search Performance**
162
 
 
163
 
164
  ## **📝 結論 / Conclusion**
165
  ✅ **Top performance in all languages**
166
  ✅ **Rust support successfully added through dataset augmentation**
167
  ✅ **Further performance improvements possible with better datasets**
 
168
 
169
  ---
170
 
 
173
 
174
  ## **📧 連絡先 / Contact**
175
  📩 **For any questions, please contact:**
176
+ 📧 **shun0212114@outlook.jp**