Update README.md
Browse files
README.md
CHANGED
|
@@ -20,6 +20,8 @@ base_model:
|
|
| 20 |
| Voyage-Code-002 | Unknown | 68.5 | 56.3 |
|
| 21 |
|
| 22 |
|
|
|
|
|
|
|
| 23 |
# Usage
|
| 24 |
|
| 25 |
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
|
|
@@ -47,3 +49,9 @@ print(query_embeddings)
|
|
| 47 |
code_embeddings = model.encode(codes)
|
| 48 |
print(code_embeddings)
|
| 49 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
| Voyage-Code-002 | Unknown | 68.5 | 56.3 |
|
| 21 |
|
| 22 |
|
| 23 |
+
We release the scripts to evaluate our model's performance [here](https://github.com/gangiswag/cornstack).
|
| 24 |
+
|
| 25 |
# Usage
|
| 26 |
|
| 27 |
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
|
|
|
|
| 49 |
code_embeddings = model.encode(codes)
|
| 50 |
print(code_embeddings)
|
| 51 |
```
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
## Training
|
| 56 |
+
We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
|
| 57 |
+
|