| | --- |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | base_model: Shuu12121/CodeModernBERT-Owl |
| | pipeline_tag: sentence-similarity |
| | library_name: sentence-transformers |
| | metrics: |
| | - code_eval |
| | model-index: |
| | - name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl |
| | results: |
| | - task: |
| | type: semantic-similarity |
| | name: Semantic Similarity |
| | dataset: |
| | name: code docstring dev |
| | type: code-docstring-dev |
| | metrics: |
| | - type: pearson_cosine |
| | value: null |
| | name: Pearson Cosine |
| | - type: spearman_cosine |
| | value: null |
| | name: Spearman Cosine |
| | license: apache-2.0 |
| | datasets: |
| | - code-search-net/code_search_net |
| | - Shuu12121/java-codesearch-dataset-open |
| | - Shuu12121/rust-codesearch-dataset-open |
| | - google/code_x_glue_ct_code_to_text |
| | language: |
| | - en |
| | --- |
| | |
| |
|
| |
|
| |
|
| |
|
| | # SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉 |
| |
|
| |
|
| |
|
| | This model is a **sentence-transformers** model fine-tuned from **[Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)**, which is a **ModernBERT model specifically designed for code, pre-trained from scratch by me**. |
| | **It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.** |
| | One of the key features of this model is its **maximum sequence length of 2048 tokens**, which allows it to handle moderately long code snippets and documentation. |
| | Despite being a relatively small model with about **150 million parameters**, it demonstrates remarkable performance in code search tasks. |
| |
|
| | --- |
| |
|
| | このモデルは、**私が一から事前学習を行ったコード特化のModernBERTモデルである [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)** をベースにファインチューニングされた **[sentence-transformers](https://www.SBERT.net)** モデルです。 |
| | **特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる** ように設計されています。 |
| | 本モデルの特徴として、**最大シーケンス長が2048トークン**に対応しており、**中程度の長さのコード片やドキュメントにも対応可能**です。 |
| | **150M程度と比較的小さいモデル**ながらも、コード検索タスクにおいて高い性能を発揮します。 |
| |
|
| |
|
| | --- |
| |
|
| | ### Model Evaluation / モデル評価 |
| |
|
| | #### CoIR Evaluation Results / CoIRにおける評価結果 |
| |
|
| | Despite being a relatively small model with around **150M parameters**, this model achieved an impressive **76.89** on the **CodeSearchNet** benchmark, demonstrating its high performance in code search tasks. |
| | Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided. |
| | In the CodeSearchNet task, this model outperforms many well-known models, as shown in the comparison table below. |
| |
|
| | このモデルは、**150M程度と比較的小さいモデル**ながら、**コードサーチタスクにおける評価指標である CodeSearchNet で 76.89** を達成しました。 |
| | 他のタスクには対応していないため、評価値は提供されていません。 |
| | CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。 |
| |
|
| | | Model Name | CodeSearchNet Score | |
| | |-----------------------------------------------|----------------------| |
| | | **Shuu12121/CodeModernBERT-Owl** | **76.89** | |
| | | Salesforce/SFR-Embedding-Code-2B_R | 73.5 | |
| | | CodeSage-large-v2 | 94.26 | |
| | | Salesforce/SFR-Embedding-Code-400M_R | 72.53 | |
| | | CodeSage-large | 90.58 | |
| | | Voyage-Code-002 | 81.79 | |
| | | E5-Mistral | 54.25 | |
| | | E5-Base-v2 | 67.99 | |
| | | OpenAI-Ada-002 | 74.21 | |
| | | BGE-Base-en-v1.5 | 69.6 | |
| | | BGE-M3 | 43.23 | |
| | | UniXcoder | 60.2 | |
| | | GTE-Base-en-v1.5 | 43.35 | |
| | | Contriever | 34.72 | |
| |
|
| | --- |
| |
|
| | ### Model Details / モデル詳細 |
| |
|
| | - **Model Type / モデルタイプ:** Sentence Transformer |
| | - **Base Model / ベースモデル:** [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) |
| | - **Maximum Sequence Length / 最大シーケンス長:** 2048 tokens |
| | - **Output Dimensions / 出力次元:** 768 dimensions |
| | - **Similarity Function / 類似度関数:** Cosine Similarity |
| | - **License / ライセンス:** Apache-2.0 |
| |
|
| | --- |
| |
|
| | ### Usage / 使用方法 |
| |
|
| | #### Installation / インストール |
| |
|
| | To install Sentence Transformers, run the following command: |
| | Sentence Transformers をインストールするには、以下のコマンドを実行します。 |
| |
|
| | ```bash |
| | pip install -U sentence-transformers |
| | ``` |
| |
|
| | #### Model Loading and Inference / モデルのロードと推論 |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | # Load the model / モデルをダウンロードしてロード |
| | model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl") |
| | |
| | # Example sentences for inference / 推論用の文リスト |
| | sentences = [ |
| | 'Encrypts the zip file', |
| | 'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file', |
| | 'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X', |
| | ] |
| | |
| | # Generate embeddings / 埋め込みベクトルの生成 |
| | embeddings = model.encode(sentences) |
| | print(embeddings.shape) # Output: [3, 768] |
| | |
| | # Calculate similarity scores / 類似度スコアの計算 |
| | similarities = model.similarity(embeddings, embeddings) |
| | print(similarities.shape) # Output: [3, 3] |
| | ``` |
| |
|
| | --- |
| |
|
| | ### Library Versions / ライブラリバージョン |
| |
|
| | - Python: 3.11.11 |
| | - Sentence Transformers: 3.4.1 |
| | - Transformers: 4.50.0 |
| | - PyTorch: 2.6.0+cu124 |
| | - Accelerate: 1.5.2 |
| | - Datasets: 3.4.1 |
| | - Tokenizers: 0.21.1 |
| |
|
| | --- |
| |
|
| | ### Citation / 引用情報 |
| |
|
| | #### Sentence Transformers |
| | ```bibtex |
| | @inproceedings{reimers-2019-sentence-bert, |
| | title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| | author = "Reimers, Nils and Gurevych, Iryna", |
| | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
| | month = "11", |
| | year = "2019", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://arxiv.org/abs/1908.10084", |
| | } |
| | ``` |
| |
|
| | #### MultipleNegativesRankingLoss |
| | ```bibtex |
| | @misc{henderson2017efficient, |
| | title={Efficient Natural Language Response Suggestion for Smart Reply}, |
| | author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
| | year={2017}, |
| | eprint={1705.00652}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL} |
| | } |
| | ``` |