CodeRankEmbed / README.md
tarsur909's picture
Update README.md
bc2403e verified
|
raw
history blame
2.51 kB
metadata
base_model:
  - Snowflake/snowflake-arctic-embed-m-long

CodeRankEmbed

CodeRankEmbed is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.

Check out our blog post and paper (to be released soon) for more details!

Performance Benchmarks

Name Parameters CSN CoIR
CodeRankEmbed 137M 77.9 60.1
Arctic-Embed-M-Long 137M 53.4 43.0
CodeSage-Small 130M 64.9 54.4
CodeSage-Base 356M 68.7 57.5
CodeSage-Large 1.3B 71.2 59.4
Jina-Code-v2 161M 67.2 58.4
CodeT5+ 110M 74.2 45.9
OpenAI-Ada-002 110M 71.3 45.6
Voyage-Code-002 Unknown 68.5 56.3

We release the scripts to evaluate our model's performance here.

Usage

Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code"

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True)
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number']
codes = ["""def func(n):
    if n <= 0:
        return "Input should be a positive integer"
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        a, b = 0, 1
        for _ in range(2, n):
            a, b = b, a + b
        return b
"""]
query_embeddings = model.encode(queries)
print(query_embeddings)
code_embeddings = model.encode(codes)
print(code_embeddings)

Training

We use a bi-encoder architecture for CodeRankEmbed, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called CoRNStack. Our encoder is initialized with Arctic-Embed-M-Long, a 137M parameter text encoder supporting an extended context length of 8,192 tokens.