nomic-ai
/

CodeRankEmbed

sentence-transformers

Model card Files Files and versions

tarsur909 commited on Nov 9, 2024

Commit

28ddab3

·

verified ·

1 Parent(s): 1b6c197

Update README.md

Files changed (1) hide show

README.md +8 -0

README.md CHANGED Viewed

@@ -20,6 +20,8 @@ base_model:
 | Voyage-Code-002        | Unknown   | 68.5     | 56.3     |
 # Usage
 **Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
@@ -47,3 +49,9 @@ print(query_embeddings)
 code_embeddings = model.encode(codes)
 print(code_embeddings)
 ```

 | Voyage-Code-002        | Unknown   | 68.5     | 56.3     |
+We release the scripts to evaluate our model's performance [here](https://github.com/gangiswag/cornstack).
 # Usage
 **Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
 code_embeddings = model.encode(codes)
 print(code_embeddings)
 ```
+## Training
+We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.