Shuu12121
/

CodeModernBERT-Owl

Model card Files Files and versions

xet

Community

Shuu12121 commited on Mar 27, 2025

Commit

c6b9f91

verified ·

1 Parent(s): a0281ec

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -17

README.md CHANGED Viewed

@@ -30,17 +30,15 @@ tags:
 ---
-# **CodeModernBERT-Owl🦉**
 ## **概要 / Overview**
-### **CodeModernBERT-Owl🦉: 高精度なコード検索 & コード理解モデル**
 **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
 Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
----
 ### **🛠 主な特徴 / Key Features**
 ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
 ✅ **Optimized for code search, code understanding, and code clone detection**
@@ -71,20 +69,20 @@ Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-M
 ## **💻 モデルの使用方法 / How to Use**
 This model can be easily loaded using the **Hugging Face Transformers** library.
-⚠️ **Requires `transformers >= 4.48.0`**
 🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**
 ### **モデルのロード / Load the Model**
-```python
 from transformers import AutoModelForMaskedLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
 model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
-```
 ### **コード埋め込みの取得 / Get Code Embeddings**
-```python
 import torch
 def get_embedding(text, model, tokenizer, device="cuda"):
@@ -98,15 +96,15 @@ def get_embedding(text, model, tokenizer, device="cuda"):
 embedding = get_embedding("def my_function(): pass", model, tokenizer)
 print(embedding.shape)
-```
 ---
 # **🔍 評価結果 / Evaluation Results**
 ### **データセット / Dataset**
-📌 **Tested on `code_x_glue_ct_code_to_text` with a candidate pool size of 100.**
-📌 **Rust-specific evaluations were conducted using `Shuu12121/rust-codesearch-dataset-open`.**
 ---
@@ -120,6 +118,28 @@ print(embedding.shape)
 | **Ruby**       | **0.8038**  | 0.7469  | **0.7568**  | 0.3318  | 0.5876  |
 | **Go**         | **0.9386**  | 0.9043  | 0.8117  | 0.3262  | 0.4243  |
 ---
 ## **🔁 別のおすすめモデル / Recommended Alternative Models**
@@ -138,15 +158,13 @@ If you need a pretrained model that supports **longer sequences or a smaller mod
 For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
 **コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
 - **Maximum Sequence Length:** 8192 tokens
-- **High Code Search Performance**
----
 ## **📝 結論 / Conclusion**
 ✅ **Top performance in all languages**
 ✅ **Rust support successfully added through dataset augmentation**
 ✅ **Further performance improvements possible with better datasets**
-✅ **Recommended for various code search and understanding tasks**
 ---
@@ -155,6 +173,4 @@ For those looking for a model that combines **long sequence length and code sear
 ## **📧 連絡先 / Contact**
 📩 **For any questions, please contact:**
-📧 **shun0212114@outlook.jp**
----

 ---
+# **CodeModernBERT-Owl**
 ## **概要 / Overview**
+### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**
 **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
 Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
 ### **🛠 主な特徴 / Key Features**
 ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
 ✅ **Optimized for code search, code understanding, and code clone detection**
 ## **💻 モデルの使用方法 / How to Use**
 This model can be easily loaded using the **Hugging Face Transformers** library.
+⚠️ **Requires transformers >= 4.48.0**
 🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**
 ### **モデルのロード / Load the Model**
+python
 from transformers import AutoModelForMaskedLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
 model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
 ### **コード埋め込みの取得 / Get Code Embeddings**
+python
 import torch
 def get_embedding(text, model, tokenizer, device="cuda"):
 embedding = get_embedding("def my_function(): pass", model, tokenizer)
 print(embedding.shape)
 ---
 # **🔍 評価結果 / Evaluation Results**
 ### **データセット / Dataset**
+📌 **Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.**
+📌 **Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.**
 ---
 | **Ruby**       | **0.8038**  | 0.7469  | **0.7568**  | 0.3318  | 0.5876  |
 | **Go**         | **0.9386**  | 0.9043  | 0.8117  | 0.3262  | 0.4243  |
+✅ **Achieves the highest accuracy in all target languages.**
+✅ **Significantly improved Java accuracy using additional fine-tuned GitHub data.**
+✅ **Outperforms previous models, especially in PHP and Go.**
+---
+## **📊 Rust (独自データセット) / Rust Performance**
+| 指標 / Metric | **CodeModernBERT-Owl** |
+|--------------|----------------|
+| **MRR**      | 0.7940 |
+| **MAP**      | 0.7940 |
+| **R-Precision** | 0.7173 |
+### **📌 K別評価指標 / Evaluation Metrics by K**
+| K  | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
+|----|-------------|---------------|------------|--------|-----------------|-----------------|
+| **1**   | 0.7173  | 0.7173  | 0.7173  | 0.7173  | 0.7173  | 0.7173  |
+| **5**   | 0.8913  | 0.7852  | 0.8118  | 0.8132  | 0.8913  | 0.8913  |
+| **10**  | 0.9333  | 0.7908  | 0.8254  | 0.8230  | 0.9333  | 0.9333  |
+| **50**  | 0.9887  | 0.7938  | 0.8383  | 0.8288  | 0.9887  | 0.9887  |
+| **100** | 1.0000  | 0.7940  | 0.8401  | 0.8291  | 1.0000  | 1.0000  |
 ---
 ## **🔁 別のおすすめモデル / Recommended Alternative Models**
 For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
 **コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
 - **Maximum Sequence Length:** 8192 tokens
+- **High Code Search Performance**
 ## **📝 結論 / Conclusion**
 ✅ **Top performance in all languages**
 ✅ **Rust support successfully added through dataset augmentation**
 ✅ **Further performance improvements possible with better datasets**
 ---
 ## **📧 連絡先 / Contact**
 📩 **For any questions, please contact:**
+📧 **shun0212114@outlook.jp**