Update README.md
Browse files
README.md
CHANGED
|
@@ -36,88 +36,102 @@ language:
|
|
| 36 |
|
| 37 |
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
このモデルは、[Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) をベースにファインチューニングされた [sentence-transformers](https://www.SBERT.net) モデルです。
|
| 42 |
**特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる** ように設計されています。
|
| 43 |
|
|
|
|
| 44 |
---
|
| 45 |
|
| 46 |
-
### モデル評価
|
| 47 |
|
| 48 |
-
#### CoIRにおける評価結果
|
| 49 |
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
他のタスクには対応していないため、評価値は提供されていません。
|
| 52 |
CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。
|
| 53 |
|
| 54 |
-
|
|
| 55 |
-
|----------------------------------------------|----------------------
|
| 56 |
-
| **Shuu12121/CodeModernBERT-Owl**
|
| 57 |
-
| Salesforce/SFR-Embedding-Code-2B_R | 73.5
|
| 58 |
-
| CodeSage-large-v2 | 94.26
|
| 59 |
-
| Salesforce/SFR-Embedding-Code-400M_R | 72.53
|
| 60 |
-
| CodeSage-large | 90.58
|
| 61 |
-
| Voyage-Code-002 | 81.79
|
| 62 |
-
| E5-Mistral | 54.25
|
| 63 |
-
| E5-Base-v2 | 67.99
|
| 64 |
-
| OpenAI-Ada-002 | 74.21
|
| 65 |
-
| BGE-Base-en-v1.5 | 69.6
|
| 66 |
-
| BGE-M3 | 43.23
|
| 67 |
-
| UniXcoder | 60.2
|
| 68 |
-
| GTE-Base-en-v1.5 | 43.35
|
| 69 |
-
| Contriever | 34.72
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
-
### モデル詳細
|
| 74 |
|
| 75 |
-
- **モデルタイプ:** Sentence Transformer
|
| 76 |
-
- **ベースモデル:** [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)
|
| 77 |
-
- **最大シーケンス長:** 2048
|
| 78 |
-
- **出力次元:** 768
|
| 79 |
-
- **類似度関数:**
|
| 80 |
-
- **ライセンス:** Apache-2.0
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
-
### 使用方法
|
| 85 |
|
| 86 |
-
####
|
| 87 |
|
| 88 |
-
|
|
|
|
| 89 |
|
| 90 |
```bash
|
| 91 |
pip install -U sentence-transformers
|
| 92 |
```
|
| 93 |
|
| 94 |
-
#### モデルのロードと推論
|
| 95 |
|
| 96 |
```python
|
| 97 |
from sentence_transformers import SentenceTransformer
|
| 98 |
|
| 99 |
-
# モデルをダウンロードしてロード
|
| 100 |
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")
|
| 101 |
|
| 102 |
-
# 推論用の文リスト
|
| 103 |
sentences = [
|
| 104 |
'Encrypts the zip file',
|
| 105 |
'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file',
|
| 106 |
'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X',
|
| 107 |
]
|
| 108 |
|
| 109 |
-
# 埋め込みベクトルの生成
|
| 110 |
embeddings = model.encode(sentences)
|
| 111 |
-
print(embeddings.shape) # [3, 768]
|
| 112 |
|
| 113 |
-
# 類似度スコアの計算
|
| 114 |
similarities = model.similarity(embeddings, embeddings)
|
| 115 |
-
print(similarities.shape) # [3, 3]
|
| 116 |
```
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
-
### ライブラリバージョン
|
| 121 |
|
| 122 |
- Python: 3.11.11
|
| 123 |
- Sentence Transformers: 3.4.1
|
|
@@ -129,7 +143,7 @@ print(similarities.shape) # [3, 3]
|
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
-
### 引用情報
|
| 133 |
|
| 134 |
#### Sentence Transformers
|
| 135 |
```bibtex
|
|
@@ -154,4 +168,4 @@ print(similarities.shape) # [3, 3]
|
|
| 154 |
archivePrefix={arXiv},
|
| 155 |
primaryClass={cs.CL}
|
| 156 |
}
|
| 157 |
-
```
|
|
|
|
| 36 |
|
| 37 |
|
| 38 |
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉
|
| 42 |
+
|
| 43 |
+
This model is a [sentence-transformers](https://www.SBERT.net) model fine-tuned from [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl).
|
| 44 |
+
**It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.**
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
|
| 49 |
このモデルは、[Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) をベースにファインチューニングされた [sentence-transformers](https://www.SBERT.net) モデルです。
|
| 50 |
**特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる** ように設計されています。
|
| 51 |
|
| 52 |
+
|
| 53 |
---
|
| 54 |
|
| 55 |
+
### Model Evaluation / モデル評価
|
| 56 |
|
| 57 |
+
#### CoIR Evaluation Results / CoIRにおける評価結果
|
| 58 |
|
| 59 |
+
Despite being a relatively small model with around **150M parameters**, this model achieved an impressive **76.89** on the **CodeSearchNet** benchmark, demonstrating its high performance in code search tasks.
|
| 60 |
+
Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided.
|
| 61 |
+
In the CodeSearchNet task, this model outperforms many well-known models, as shown in the comparison table below.
|
| 62 |
+
|
| 63 |
+
このモデルは、**150M程度と比較的小さいモデル**ながら、**コードサーチタスクにおける評価指標である CodeSearchNet で 76.89** を達成しました。
|
| 64 |
他のタスクには対応していないため、評価値は提供されていません。
|
| 65 |
CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。
|
| 66 |
|
| 67 |
+
| Model Name | CodeSearchNet Score |
|
| 68 |
+
|-----------------------------------------------|----------------------|
|
| 69 |
+
| **Shuu12121/CodeModernBERT-Owl** | **76.89** |
|
| 70 |
+
| Salesforce/SFR-Embedding-Code-2B_R | 73.5 |
|
| 71 |
+
| CodeSage-large-v2 | 94.26 |
|
| 72 |
+
| Salesforce/SFR-Embedding-Code-400M_R | 72.53 |
|
| 73 |
+
| CodeSage-large | 90.58 |
|
| 74 |
+
| Voyage-Code-002 | 81.79 |
|
| 75 |
+
| E5-Mistral | 54.25 |
|
| 76 |
+
| E5-Base-v2 | 67.99 |
|
| 77 |
+
| OpenAI-Ada-002 | 74.21 |
|
| 78 |
+
| BGE-Base-en-v1.5 | 69.6 |
|
| 79 |
+
| BGE-M3 | 43.23 |
|
| 80 |
+
| UniXcoder | 60.2 |
|
| 81 |
+
| GTE-Base-en-v1.5 | 43.35 |
|
| 82 |
+
| Contriever | 34.72 |
|
| 83 |
|
| 84 |
---
|
| 85 |
|
| 86 |
+
### Model Details / モデル詳細
|
| 87 |
|
| 88 |
+
- **Model Type / モデルタイプ:** Sentence Transformer
|
| 89 |
+
- **Base Model / ベースモデル:** [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)
|
| 90 |
+
- **Maximum Sequence Length / 最大シーケンス長:** 2048 tokens
|
| 91 |
+
- **Output Dimensions / 出力次元:** 768 dimensions
|
| 92 |
+
- **Similarity Function / 類似度関数:** Cosine Similarity
|
| 93 |
+
- **License / ライセンス:** Apache-2.0
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
+
### Usage / 使用方法
|
| 98 |
|
| 99 |
+
#### Installation / インストール
|
| 100 |
|
| 101 |
+
To install Sentence Transformers, run the following command:
|
| 102 |
+
Sentence Transformers をインストールするには、以下のコマンドを実行します。
|
| 103 |
|
| 104 |
```bash
|
| 105 |
pip install -U sentence-transformers
|
| 106 |
```
|
| 107 |
|
| 108 |
+
#### Model Loading and Inference / モデルのロードと推論
|
| 109 |
|
| 110 |
```python
|
| 111 |
from sentence_transformers import SentenceTransformer
|
| 112 |
|
| 113 |
+
# Load the model / モデルをダウンロードしてロード
|
| 114 |
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")
|
| 115 |
|
| 116 |
+
# Example sentences for inference / 推論用の文リスト
|
| 117 |
sentences = [
|
| 118 |
'Encrypts the zip file',
|
| 119 |
'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file',
|
| 120 |
'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X',
|
| 121 |
]
|
| 122 |
|
| 123 |
+
# Generate embeddings / 埋め込みベクトルの生成
|
| 124 |
embeddings = model.encode(sentences)
|
| 125 |
+
print(embeddings.shape) # Output: [3, 768]
|
| 126 |
|
| 127 |
+
# Calculate similarity scores / 類似度スコアの計算
|
| 128 |
similarities = model.similarity(embeddings, embeddings)
|
| 129 |
+
print(similarities.shape) # Output: [3, 3]
|
| 130 |
```
|
| 131 |
|
| 132 |
---
|
| 133 |
|
| 134 |
+
### Library Versions / ライブラリバージョン
|
| 135 |
|
| 136 |
- Python: 3.11.11
|
| 137 |
- Sentence Transformers: 3.4.1
|
|
|
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
+
### Citation / 引用情報
|
| 147 |
|
| 148 |
#### Sentence Transformers
|
| 149 |
```bibtex
|
|
|
|
| 168 |
archivePrefix={arXiv},
|
| 169 |
primaryClass={cs.CL}
|
| 170 |
}
|
| 171 |
+
```
|