Shuu12121
/

CodeModernBERT-Owl

Model card Files Files and versions

xet

Community

Shuu12121 commited on Mar 26, 2025

Commit

a0281ec

verified ·

1 Parent(s): d403250

Update README.md

Browse files

Files changed (1) hide show

README.md +27 -22

README.md CHANGED Viewed

@@ -3,6 +3,8 @@ license: apache-2.0
 datasets:
 - Shuu12121/rust-codesearch-dataset-open
 - Shuu12121/java-codesearch-dataset-open
 language:
 - en
 pipeline_tag: sentence-similarity
@@ -28,15 +30,17 @@ tags:
 ---
-# **CodeModernBERT-Owl**
 ## **概要 / Overview**
-### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**
 **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
 Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
 ### **🛠 主な特徴 / Key Features**
 ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
 ✅ **Optimized for code search, code understanding, and code clone detection**
@@ -116,27 +120,25 @@ print(embedding.shape)
 | **Ruby**       | **0.8038**  | 0.7469  | **0.7568**  | 0.3318  | 0.5876  |
 | **Go**         | **0.9386**  | 0.9043  | 0.8117  | 0.3262  | 0.4243  |
-✅ **Achieves the highest accuracy in all target languages.**
-✅ **Significantly improved Java accuracy using additional fine-tuned GitHub data.**
-✅ **Outperforms previous models, especially in PHP and Go.**
 ---
-## **📊 Rust (独自データセット) / Rust Performance**
-| 指標 / Metric | **CodeModernBERT-Owl** |
-|--------------|----------------|
-| **MRR**      | 0.7940 |
-| **MAP**      | 0.7940 |
-| **R-Precision** | 0.7173 |
-### **📌 K別評価指標 / Evaluation Metrics by K**
-| K  | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
-|----|-------------|---------------|------------|--------|-----------------|-----------------|
-| **1**   | 0.7173  | 0.7173  | 0.7173  | 0.7173  | 0.7173  | 0.7173  |
-| **5**   | 0.8913  | 0.7852  | 0.8118  | 0.8132  | 0.8913  | 0.8913  |
-| **10**  | 0.9333  | 0.7908  | 0.8254  | 0.8230  | 0.9333  | 0.9333  |
-| **50**  | 0.9887  | 0.7938  | 0.8383  | 0.8288  | 0.9887  | 0.9887  |
-| **100** | 1.0000  | 0.7940  | 0.8401  | 0.8291  | 1.0000  | 1.0000  |
 ---
@@ -144,6 +146,7 @@ print(embedding.shape)
 ✅ **Top performance in all languages**
 ✅ **Rust support successfully added through dataset augmentation**
 ✅ **Further performance improvements possible with better datasets**
 ---
@@ -152,4 +155,6 @@ print(embedding.shape)
 ## **📧 連絡先 / Contact**
 📩 **For any questions, please contact:**
-📧 **shun0212114@outlook.jp**

 datasets:
 - Shuu12121/rust-codesearch-dataset-open
 - Shuu12121/java-codesearch-dataset-open
+- code-search-net/code_search_net
+- google/code_x_glue_ct_code_to_text
 language:
 - en
 pipeline_tag: sentence-similarity
 ---
+# **CodeModernBERT-Owl🦉**
 ## **概要 / Overview**
+### **CodeModernBERT-Owl🦉: 高精度なコード検索 & コード理解モデル**
 **CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
 Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
+---
 ### **🛠 主な特徴 / Key Features**
 ✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
 ✅ **Optimized for code search, code understanding, and code clone detection**
 | **Ruby**       | **0.8038**  | 0.7469  | **0.7568**  | 0.3318  | 0.5876  |
 | **Go**         | **0.9386**  | 0.9043  | 0.8117  | 0.3262  | 0.4243  |
 ---
+## **🔁 別のおすすめモデル / Recommended Alternative Models**
+### 1. **CodeSearch-ModernBERT-Owl🦉** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Owl)
+If you need a model that is **more specialized for code search**, this model is highly recommended.
+コードサーチに**特化したモデルが必要な場合**はこちらがおすすめです。
+### 2. **CodeModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeModernBERT-Snake)
+If you need a pretrained model that supports **longer sequences or a smaller model size**, this model is ideal.
+**シーケンス長が長い**、または**モデルサイズが小さい**事前学習済みモデルが必要な場合はこちらをおすすめします。
+- **Maximum Sequence Length:** 8192 tokens
+- **Smaller Model Size:** ~75M parameters
+### 3. **CodeSearch-ModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Snake)
+For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
+**コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
+- **Maximum Sequence Length:** 8192 tokens
+- **High Code Search Performance**
 ---
 ✅ **Top performance in all languages**
 ✅ **Rust support successfully added through dataset augmentation**
 ✅ **Further performance improvements possible with better datasets**
+✅ **Recommended for various code search and understanding tasks**
 ---
 ## **📧 連絡先 / Contact**
 📩 **For any questions, please contact:**
+📧 **shun0212114@outlook.jp**
+---