Shuu12121 commited on
Commit
a0281ec
·
verified ·
1 Parent(s): d403250

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -22
README.md CHANGED
@@ -3,6 +3,8 @@ license: apache-2.0
3
  datasets:
4
  - Shuu12121/rust-codesearch-dataset-open
5
  - Shuu12121/java-codesearch-dataset-open
 
 
6
  language:
7
  - en
8
  pipeline_tag: sentence-similarity
@@ -28,15 +30,17 @@ tags:
28
  ---
29
 
30
 
31
- # **CodeModernBERT-Owl**
32
 
33
  ## **概要 / Overview**
34
 
35
- ### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**
36
  **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
37
 
38
  Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
39
 
 
 
40
  ### **🛠 主な特徴 / Key Features**
41
  ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
42
  ✅ **Optimized for code search, code understanding, and code clone detection**
@@ -116,27 +120,25 @@ print(embedding.shape)
116
  | **Ruby** | **0.8038** | 0.7469 | **0.7568** | 0.3318 | 0.5876 |
117
  | **Go** | **0.9386** | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
118
 
119
- ✅ **Achieves the highest accuracy in all target languages.**
120
- ✅ **Significantly improved Java accuracy using additional fine-tuned GitHub data.**
121
- ✅ **Outperforms previous models, especially in PHP and Go.**
122
-
123
  ---
124
 
125
- ## **📊 Rust (独自データセット) / Rust Performance**
126
- | 指標 / Metric | **CodeModernBERT-Owl** |
127
- |--------------|----------------|
128
- | **MRR** | 0.7940 |
129
- | **MAP** | 0.7940 |
130
- | **R-Precision** | 0.7173 |
131
-
132
- ### **📌 K別評価指標 / Evaluation Metrics by K**
133
- | K | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
134
- |----|-------------|---------------|------------|--------|-----------------|-----------------|
135
- | **1** | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 |
136
- | **5** | 0.8913 | 0.7852 | 0.8118 | 0.8132 | 0.8913 | 0.8913 |
137
- | **10** | 0.9333 | 0.7908 | 0.8254 | 0.8230 | 0.9333 | 0.9333 |
138
- | **50** | 0.9887 | 0.7938 | 0.8383 | 0.8288 | 0.9887 | 0.9887 |
139
- | **100** | 1.0000 | 0.7940 | 0.8401 | 0.8291 | 1.0000 | 1.0000 |
 
 
140
 
141
  ---
142
 
@@ -144,6 +146,7 @@ print(embedding.shape)
144
  ✅ **Top performance in all languages**
145
  ✅ **Rust support successfully added through dataset augmentation**
146
  ✅ **Further performance improvements possible with better datasets**
 
147
 
148
  ---
149
 
@@ -152,4 +155,6 @@ print(embedding.shape)
152
 
153
  ## **📧 連絡先 / Contact**
154
  📩 **For any questions, please contact:**
155
- 📧 **shun0212114@outlook.jp**
 
 
 
3
  datasets:
4
  - Shuu12121/rust-codesearch-dataset-open
5
  - Shuu12121/java-codesearch-dataset-open
6
+ - code-search-net/code_search_net
7
+ - google/code_x_glue_ct_code_to_text
8
  language:
9
  - en
10
  pipeline_tag: sentence-similarity
 
30
  ---
31
 
32
 
33
+ # **CodeModernBERT-Owl🦉**
34
 
35
  ## **概要 / Overview**
36
 
37
+ ### **CodeModernBERT-Owl🦉: 高精度なコード検索 & コード理解モデル**
38
  **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
39
 
40
  Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
41
 
42
+ ---
43
+
44
  ### **🛠 主な特徴 / Key Features**
45
  ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
46
  ✅ **Optimized for code search, code understanding, and code clone detection**
 
120
  | **Ruby** | **0.8038** | 0.7469 | **0.7568** | 0.3318 | 0.5876 |
121
  | **Go** | **0.9386** | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
122
 
 
 
 
 
123
  ---
124
 
125
+ ## **🔁 別のおすすめモデル / Recommended Alternative Models**
126
+
127
+ ### 1. **CodeSearch-ModernBERT-Owl🦉** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Owl)
128
+ If you need a model that is **more specialized for code search**, this model is highly recommended.
129
+ コードサーチに**特化したモデルが必要な場合**はこちらがおすすめです。
130
+
131
+ ### 2. **CodeModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeModernBERT-Snake)
132
+ If you need a pretrained model that supports **longer sequences or a smaller model size**, this model is ideal.
133
+ **シーケンス長が長い**、または**モデルサイズが小さい**事前学習済みモデルが必要な場合はこちらをおすすめします。
134
+ - **Maximum Sequence Length:** 8192 tokens
135
+ - **Smaller Model Size:** ~75M parameters
136
+
137
+ ### 3. **CodeSearch-ModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Snake)
138
+ For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
139
+ **コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
140
+ - **Maximum Sequence Length:** 8192 tokens
141
+ - **High Code Search Performance**
142
 
143
  ---
144
 
 
146
  ✅ **Top performance in all languages**
147
  ✅ **Rust support successfully added through dataset augmentation**
148
  ✅ **Further performance improvements possible with better datasets**
149
+ ✅ **Recommended for various code search and understanding tasks**
150
 
151
  ---
152
 
 
155
 
156
  ## **📧 連絡先 / Contact**
157
  📩 **For any questions, please contact:**
158
+ 📧 **shun0212114@outlook.jp**
159
+
160
+ ---