Shuu12121/CodeSearch-ModernBERT-Crow-Plus๐Ÿฆโ€โฌ›

ใ“ใฎใƒขใƒ‡ใƒซใฏใ€Shuu12121/CodeModernBERT-Crow ใ‚’ใƒ™ใƒผใ‚นใซใ—ใŸ Sentence Transformer ใƒขใƒ‡ใƒซใงใ‚ใ‚Šใ€็‰นใซๅคš่จ€่ชžใ‚ณใƒผใƒ‰ๆคœ็ดขใ‚ฟใ‚นใ‚ฏใซใŠใ„ใฆ้ซ˜ใ„ๆ€ง่ƒฝใ‚’็™บๆฎใ™ใ‚‹ใ‚ˆใ†ใซใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐใ•ใ‚Œใฆใ„ใพใ™ใ€‚

This is a Sentence Transformer model based on Shuu12121/CodeModernBERT-Crow, fine-tuned for high performance on multilingual code search tasks.

Open in Colab ๐Ÿ‘‰ Google Colab ไธŠใงไปŠใ™ใ่ฉฆใ™
ใ“ใฎใƒขใƒ‡ใƒซใ‚’ไฝฟใฃใŸใ€GitHubใƒชใƒใ‚ธใƒˆใƒชใฎ้–ขๆ•ฐใƒฌใƒ™ใƒซใ‚ณใƒผใƒ‰ๆคœ็ดขใ‚ทใ‚นใƒ†ใƒ ใ‚’็ฐกๅ˜ใซ่ฉฆใ™ใ“ใจใŒใงใใพใ™๏ผ

๐Ÿ“Š MTEB Leaderboard ๆˆ็ธพ

ๆœฌใƒขใƒ‡ใƒซ CodeSearch-ModernBERT-Crow-Plus ใฏใ€Massive Text Embedding Benchmark (MTEB) ใซใŠใ‘ใ‚‹ไปฅไธ‹ใฎใ‚ฟใ‚นใ‚ฏใง้ซ˜ใ„้ †ไฝใ‚’่จ˜้Œฒใ—ใฆใ„ใพใ™๏ผš ็พๅœจใฎ้ †ไฝใฏLeaderboardใฏใ“ใกใ‚‰ใงใ”็ขบ่ชใ—ใฆใใ ใ•ใ„

ใ‚ฟใ‚นใ‚ฏๅ nDCG@10 ใ‚นใ‚ณใ‚ข ้ †ไฝ๏ผˆ2025ๅนด4ๆœˆๆ™‚็‚น๏ผ‰
CodeSearchNetRetrieval 0.89296 ็ฌฌ8ไฝ / 146 ใƒขใƒ‡ใƒซไธญ
COIRCodeSearchNetRetrieval 0.79884 ็ฌฌ5ไฝ / 15 ใƒขใƒ‡ใƒซไธญ

ใ“ใ‚Œใ‚‰ใฎ็ตๆžœใฏใ€ๆœฌใƒขใƒ‡ใƒซใŒใ‚ณใƒผใƒ‰ๆคœ็ดขใ‚ฟใ‚นใ‚ฏใซใŠใ„ใฆ้žๅธธใซ็ซถไบ‰ๅŠ›ใฎใ‚ใ‚‹ๆ€ง่ƒฝใ‚’็™บๆฎใ—ใฆใ„ใ‚‹ใ“ใจใ‚’็คบใ—ใฆใ„ใพใ™ใ€‚ ็‰นใซใ€ๅคš่จ€่ชžใƒป่‡ช็„ถ่จ€่ชžโ€“ใ‚ณใƒผใƒ‰้–“ใฎๆคœ็ดข็ฒพๅบฆใซๅ„ชใ‚ŒใŸ Sentence Transformer ใƒขใƒ‡ใƒซใจใ—ใฆใ€ๅฎŸ็”จ็š„ใช้ธๆŠž่‚ขใฎไธ€ใคใงใ™ใ€‚

๐Ÿงฉ ้–ข้€ฃใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆใจใฎ้€ฃๆบ / Integration with Related Projects

CodeSearch-ModernBERT-Crow-Plus ใฏใ€€CodeSearchCrow.ipynb ใฎใ‚ˆใ†ใซ ๅฎŸ้š›ใฎGitHubใƒชใƒใ‚ธใƒˆใƒชใ‚’ๅฏพ่ฑกใจใ—ใŸใ€้–ขๆ•ฐๅ˜ไฝใฎใ‚ณใƒผใƒ‰ๆคœ็ดขใ‚ทใ‚นใƒ†ใƒ ใ‚’็ฐกๅ˜ใซๆง‹็ฏ‰ใงใใพใ™ใ€‚

ใ“ใฎใƒŽใƒผใƒˆใƒ–ใƒƒใ‚ฏใงใฏไปฅไธ‹ใฎๅ‡ฆ็†ใŒๅฎŸ่กŒใ•ใ‚Œใพใ™๏ผš

  • GitHubใƒชใƒใ‚ธใƒˆใƒชใ‚’ๆŒ‡ๅฎšใ—ใฆใ‚ฏใƒญใƒผใƒณ
  • .py ใพใŸใฏ .ipynb ใƒ•ใ‚กใ‚คใƒซใ‹ใ‚‰้–ขๆ•ฐใƒปใ‚ณใƒผใƒ‰ใ‚ปใƒซใ‚’ๆŠฝๅ‡บ
  • ้–ขๆ•ฐใ‚ณใƒผใƒ‰ใ‚’ใ‚จใƒณใƒ™ใƒ‡ใ‚ฃใƒณใ‚ฐ๏ผˆSentence Transformerใƒขใƒ‡ใƒซใ‚’ไฝฟ็”จ๏ผ‰
  • FAISSใ‚คใƒณใƒ‡ใƒƒใ‚ฏใ‚นใ‚’ไฝœๆˆใ—ใฆ้ซ˜้€Ÿๆคœ็ดขใ‚’ๅฏ่ƒฝใซ
  • Qwen3-8B-FP8ใƒขใƒ‡ใƒซใซใ‚ˆใ‚‹ๆ—ฅๆœฌ่ชžโ†’่‹ฑ่ชž็ฟป่จณใ‚’้€šใ˜ใฆใ€ๆ—ฅๆœฌ่ชžใ‚ฏใ‚จใƒชใงใ‚‚่‡ช็„ถใชๆคœ็ดขใ‚’ๅฎŸ็พ

๐Ÿ”น ็‰นๅพด

  • ๅˆๅ›žๅฎŸ่กŒๆ™‚ใซใ‚คใƒณใƒ‡ใƒƒใ‚ฏใ‚นใ‚’ไฝœๆˆใ—ใ€ไปฅ้™ใฏ้ซ˜้€Ÿใซๅ†ๅˆฉ็”จๅฏ่ƒฝ
  • ้–ขๆ•ฐใƒฌใƒ™ใƒซใงใ‚ณใƒผใƒ‰ใ‚’ๆคœ็ดขใงใใ‚‹ใŸใ‚ใ€ๆ„ๅ‘ณ็š„ใซๆœ€ใ‚‚้กžไผผใ™ใ‚‹ใ‚ณใƒผใƒ‰ใ‚’้ซ˜็ฒพๅบฆใซๆคœ็ดขๅฏ่ƒฝ
  • ๆ—ฅๆœฌ่ชžใ‚ฏใ‚จใƒชใซใ‚‚ๅฎŒๅ…จๅฏพๅฟœ๏ผˆQwen3-8B-FP8ใง่‹ฑ่ชž็ฟป่จณๅพŒใซๆคœ็ดข๏ผ‰

๐Ÿ”— ใƒชใƒณใ‚ฏ

  • ๐Ÿ“„ ๅฎŸ่กŒๅฏ่ƒฝใชใƒŽใƒผใƒˆใƒ–ใƒƒใ‚ฏ๏ผš.ipynb

ๆฆ‚่ฆ / Overview

CodeSearch-ModernBERT-Crow-Plus ใฏใ€่‡ช็„ถ่จ€่ชžใฎใ‚ฏใ‚จใƒชใจ่ค‡ๆ•ฐใฎใƒ—ใƒญใ‚ฐใƒฉใƒŸใƒณใ‚ฐ่จ€่ชž๏ผˆPython, Java, JavaScript, PHP, Ruby, Go, Rust๏ผ‰ใฎใ‚ณใƒผใƒ‰ใ‚นใƒ‹ใƒšใƒƒใƒˆ๏ผˆไธปใซ้–ขๆ•ฐใƒฌใƒ™ใƒซ๏ผ‰้–“ใฎๆ„ๅ‘ณ็š„ใช้กžไผผๆ€งใ‚’ๆ‰ใˆใ‚‹ใŸใ‚ใซ่จญ่จˆใ•ใ‚ŒใŸ Sentence Transformer ใƒขใƒ‡ใƒซใงใ™ใ€‚ใƒ™ใƒผใ‚นใƒขใƒ‡ใƒซใงใ‚ใ‚‹ CodeModernBERT-Crow ใฎๅผทๅŠ›ใชใ‚ณใƒผใƒ‰็†่งฃ่ƒฝๅŠ›ใ‚’็ถ™ๆ‰ฟใ—ใ€ใ‚ณใƒผใƒ‰ๆคœ็ดขใ‚„้กžไผผๆ€งๅˆคๅฎšใ‚ฟใ‚นใ‚ฏใซๆœ€้ฉๅŒ–ใ•ใ‚Œใฆใ„ใพใ™ใ€‚

CodeSearch-ModernBERT-Crow-Plus is a Sentence Transformer model designed to capture the semantic similarity between natural language queries and code snippets (primarily at the function level) across multiple programming languages (Python, Java, JavaScript, PHP, Ruby, Go, Rust). It inherits the strong code understanding capabilities of its base model, CodeModernBERT-Crow, and is optimized for code search and similarity tasks.

ใƒขใƒ‡ใƒซ่ฉณ็ดฐ / Model Details

  • ใƒ™ใƒผใ‚นใƒขใƒ‡ใƒซ / Base Model: Shuu12121/CodeModernBERT-Crow
    • ใ‚ขใƒผใ‚ญใƒ†ใ‚ฏใƒใƒฃ / Architecture: ModernBERT (hidden_size: 768, layers: 12, heads: 12)
    • ๆœ€ๅคงๅ…ฅๅŠ›้•ท / Max Sequence Length: 1024 ใƒˆใƒผใ‚ฏใƒณ
  • ใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐ / Fine-tuning: ใ“ใฎใƒขใƒ‡ใƒซใฏใ€ใ‚ณใƒผใƒ‰ใจใใฎๅฏพๅฟœใ™ใ‚‹ใƒ‰ใ‚ญใƒฅใƒกใƒณใƒˆ๏ผˆไพ‹๏ผšCodeSearchNet ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ๏ผ‰ใ‚’็”จใ„ใŸ้กžไผผๆ€งๅญฆ็ฟ’ใ‚ฟใ‚นใ‚ฏใงใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐใ•ใ‚Œใฆใ„ใ‚‹ใจ่€ƒใˆใ‚‰ใ‚Œใพใ™ใ€‚Sentence Transformer ใƒฉใ‚คใƒ–ใƒฉใƒชใงไฝฟ็”จใ™ใ‚‹ใŸใ‚ใซใ€Pooling ๅฑคใŒ่ฟฝๅŠ ใ•ใ‚Œใฆใ„ใพใ™ใ€‚

ไฝฟ็”จๆ–นๆณ• / How to Use

sentence-transformers ใƒฉใ‚คใƒ–ใƒฉใƒชใ‚’ไฝฟใฃใฆ็ฐกๅ˜ใซๅˆฉ็”จใงใใพใ™ใ€‚

You can easily use this model with the sentence-transformers library.

from sentence_transformers import SentenceTransformer
import torch

# ใƒขใƒ‡ใƒซใฎใƒญใƒผใƒ‰ / Load the model
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Crow-Plus")

# ใ‚จใƒณใ‚ณใƒผใƒ‰ใ—ใŸใ„ใƒ†ใ‚ญใ‚นใƒˆ๏ผˆใ‚ณใƒผใƒ‰ใพใŸใฏ่‡ช็„ถ่จ€่ชž๏ผ‰ / Texts to encode (code or natural language)
code_snippets = [
    "def factorial(n): if n == 0: return 1 else: return n * factorial(n-1)",
    "function binarySearch(arr, target) { let left = 0, right = arr.length - 1; while (left <= right) { const mid = Math.floor((left + right) / 2); if (arr[mid] === target) return mid; if (arr[mid] < target) left = mid + 1; else right = mid - 1; } return -1; }"
]

natural_language_queries = [
    "calculate the factorial of a number recursively",
    "find an element in a sorted array using binary search"
]

# ใ‚จใƒณใƒ™ใƒ‡ใ‚ฃใƒณใ‚ฐใฎๅ–ๅพ— / Get embeddings
code_embeddings = model.encode(code_snippets)
query_embeddings = model.encode(natural_language_queries)

print("Code Embeddings Shape:", code_embeddings.shape)
print("Query Embeddings Shape:", query_embeddings.shape)

# ้กžไผผๅบฆใฎ่จˆ็ฎ—๏ผˆไพ‹๏ผšใ‚ณใ‚ตใ‚คใƒณ้กžไผผๅบฆ๏ผ‰ / Calculate similarity (e.g., cosine similarity)
# Requires a similarity function, e.g., from sentence_transformers.util or sklearn.metrics.pairwise
# from sentence_transformers.util import cos_sim
# similarities = cos_sim(query_embeddings, code_embeddings)
# print(similarities)

่ฉ•ไพก / Evaluation

ใ“ใฎใƒขใƒ‡ใƒซใฏ MTEB (Massive Text Embedding Benchmark) ใง่ฉ•ไพกใ•ใ‚Œใฆใ„ใพใ™ใ€‚

This model has been evaluated on the MTEB (Massive Text Embedding Benchmark).

ใ‚ฟใ‚นใ‚ฏ: CodeSearchNet Retrieval

  • MTEB ๆจ™ๆบ–่ฉ•ไพก (main_score: nDCG@10): 0.89296

    • ndcg_at_1: 0.8135
    • ndcg_at_3: 0.8781
    • ndcg_at_5: 0.8868
    • ndcg_at_10: 0.8930
    • ndcg_at_20: 0.8947
    • ndcg_at_100: 0.8971
    • ndcg_at_1000: 0.8995
    • map_at_10: 0.8705
    • recall_at_10: 0.9610
    • mrr_at_10: 0.8705
  • COIR็‰ˆใงใฎ่ฉ•ไพก (main_score: nDCG@10): 0.79884

    • ndcg_at_1: 0.7152
    • ndcg_at_3: 0.7762
    • ndcg_at_5: 0.7885
    • ndcg_at_10: 0.7988
    • ndcg_at_20: 0.8056
    • ndcg_at_100: 0.8134
    • ndcg_at_1000: 0.8172
    • map_at_10: 0.7729
    • recall_at_10: 0.8794
    • mrr_at_10: 0.7729

ๆณจ: ่ฉ•ไพก่จญๅฎšใฎ้•ใ„ใซใ‚ˆใ‚Šใ€ๅŒใ˜ CodeSearchNet Retrieval ใ‚ฟใ‚นใ‚ฏใงใ‚‚ใ‚นใ‚ณใ‚ขใŒ็•ฐใชใ‚Šใพใ™ใ€‚ Note: Scores differ for the same CodeSearchNet Retrieval task due to different evaluation settings.

ๅ‚่€ƒใจใ—ใฆใ€ใƒ™ใƒผใ‚นใƒขใƒ‡ใƒซ Shuu12121/CodeModernBERT-Crow ใฎ CodeSearchNet Test Split ใซใŠใ‘ใ‚‹ MRR@100 ใ‚นใ‚ณใ‚ขใฏไปฅไธ‹ใฎ้€šใ‚Šใงใ™๏ผˆๅ›บๅฎš่ฉ•ไพกใ‚นใ‚ฏใƒชใƒ—ใƒˆไฝฟ็”จ๏ผ‰ใ€‚

For reference, the MRR@100 scores for the base model Shuu12121/CodeModernBERT-Crow on the CodeSearchNet Test Split (using a fixed evaluation script) are:

่จ€่ชž / Language Python Java JavaScript PHP Ruby Go
MRR@100 0.9372 0.8642 0.8118 0.8388 0.8392 0.8522

ๆƒณๅฎšใ—ใฆใ„ใ‚‹็”จ้€”ใจๅˆถ้™ / Intended Use & Limitations

  • ๆƒณๅฎšใ—ใฆใ„ใ‚‹็”จ้€” / Intended Use:
    • ๅคš่จ€่ชžใ‚ณใƒผใƒ‰ๆคœ็ดข (Natural Language to Code, Code to Code)
    • ใ‚ณใƒผใƒ‰ใฎ้กžไผผๆ€งๅˆคๅฎš
    • ใ‚ณใƒผใƒ‰ๅˆ†้กžใ‚„ใ‚ฏใƒฉใ‚นใ‚ฟใƒชใƒณใ‚ฐใฎใŸใ‚ใฎ็‰นๅพดๆŠฝๅ‡บ
    • ใ‚ณใƒผใƒ‰ๆŽจ่–ฆใ‚ทใ‚นใƒ†ใƒ 
  • ๅฏพ่ฑก่จ€่ชž / Target Languages: Python, Java, JavaScript, PHP, Ruby, Go, Rust
  • ๅˆถ้™ / Limitations:
    • ไธปใซ้–ขๆ•ฐใƒฌใƒ™ใƒซใฎใ‚ณใƒผใƒ‰ใ‚นใƒ‹ใƒšใƒƒใƒˆใซๆœ€้ฉๅŒ–ใ•ใ‚Œใฆใ„ใพใ™ใ€‚้žๅธธใซ้•ทใ„ใ‚ณใƒผใƒ‰ใƒ•ใ‚กใ‚คใƒซๅ…จไฝ“ใ‚„ใ€ๆง‹ๆ–‡็š„ใซไธๅฎŒๅ…จใชใ‚ณใƒผใƒ‰ใซๅฏพใ™ใ‚‹ๆ€ง่ƒฝใฏไฝŽไธ‹ใ™ใ‚‹ๅฏ่ƒฝๆ€งใŒใ‚ใ‚Šใพใ™ใ€‚
    • ็‰นๅฎšใฎใƒ‰ใƒกใ‚คใƒณใ‚„ใƒฉใ‚คใƒ–ใƒฉใƒชใซ็‰นๅŒ–ใ—ใŸใ‚ฟใ‚นใ‚ฏใงใฏใ€่ฟฝๅŠ ใฎใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐใŒๆœ‰ๅŠนใชๅ ดๅˆใŒใ‚ใ‚Šใพใ™ใ€‚
    • ็”Ÿๆˆใ‚ฟใ‚นใ‚ฏใซใฏ้ฉใ—ใฆใ„ใพใ›ใ‚“๏ผˆใ“ใ‚Œใฏใ‚จใƒณใ‚ณใƒผใƒ€ใƒขใƒ‡ใƒซใงใ™๏ผ‰ใ€‚

Note:
This model was evaluated on MTEB with commit hash 044a7a4b552f86e284817234c336bccf16f895ce.
The current README may have been updated since that version, but the model weights remain unchanged.

้€ฃ็ตกๅ…ˆ / Contact

่ณชๅ•ใ‚„ๆๆกˆใซใคใ„ใฆใฏใ€้–‹็™บ่€… Shuu12121 ใพใงใ”้€ฃ็ตกใใ ใ•ใ„ใ€‚ For questions or suggestions, please contact the developer Shuu12121.

๐Ÿ“ง shun0212114@outlook.jp

Downloads last month
41
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Shuu12121/CodeSearch-ModernBERT-Crow-Plus

Finetuned
(2)
this model

Dataset used to train Shuu12121/CodeSearch-ModernBERT-Crow-Plus

Spaces using Shuu12121/CodeSearch-ModernBERT-Crow-Plus 3