Sarthak
commited on
Commit
·
6742590
1
Parent(s):
12d70ca
docs: rename model to codemalt and update evaluation instructions
Browse filesThis commit updates the README to reflect the new model name and simplifies the evaluation instructions.
README.md
CHANGED
|
@@ -4,7 +4,7 @@ library_name: distiller
|
|
| 4 |
license: apache-2.0
|
| 5 |
license_name: apache-2.0
|
| 6 |
license_link: LICENSE
|
| 7 |
-
model_name: codemalt
|
| 8 |
tags:
|
| 9 |
- code-search
|
| 10 |
- code-embeddings
|
|
@@ -24,9 +24,9 @@ language:
|
|
| 24 |
pipeline_tag: feature-extraction
|
| 25 |
---
|
| 26 |
|
| 27 |
-
# CodeMalt
|
| 28 |
|
| 29 |
-
**CodeMalt
|
| 30 |
|
| 31 |
## 🏆 Performance Highlights
|
| 32 |
|
|
@@ -130,7 +130,7 @@ results = distill.run_local_distillation(
|
|
| 130 |
|
| 131 |
# Evaluate on CodeSearchNet
|
| 132 |
evaluation_results = evaluate.run_evaluation(
|
| 133 |
-
models=["
|
| 134 |
max_queries=1000,
|
| 135 |
languages=["python", "javascript", "java", "go", "php", "ruby"]
|
| 136 |
)
|
|
@@ -152,8 +152,6 @@ analyze.main(
|
|
| 152 |
- General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3`
|
| 153 |
- Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct`
|
| 154 |
|
| 155 |
-
- **CodeMalt Model Series**: Our flagship models follow the naming convention `codemalt-base-[N]m` where `[N]m` indicates millions of parameters (e.g., `codemalt-base-8m` has ~7.6 million parameters)
|
| 156 |
-
|
| 157 |
- **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach:
|
| 158 |
1. Model2Vec distillation (basic static embeddings)
|
| 159 |
2. Feature extraction using sentence transformers
|
|
|
|
| 4 |
license: apache-2.0
|
| 5 |
license_name: apache-2.0
|
| 6 |
license_link: LICENSE
|
| 7 |
+
model_name: codemalt
|
| 8 |
tags:
|
| 9 |
- code-search
|
| 10 |
- code-embeddings
|
|
|
|
| 24 |
pipeline_tag: feature-extraction
|
| 25 |
---
|
| 26 |
|
| 27 |
+
# CodeMalt
|
| 28 |
|
| 29 |
+
**CodeMalt** is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves **73.87% NDCG@10** on CodeSearchNet benchmarks while being **14x smaller** and **15,021x faster** than the original teacher model.
|
| 30 |
|
| 31 |
## 🏆 Performance Highlights
|
| 32 |
|
|
|
|
| 130 |
|
| 131 |
# Evaluate on CodeSearchNet
|
| 132 |
evaluation_results = evaluate.run_evaluation(
|
| 133 |
+
models=["."],
|
| 134 |
max_queries=1000,
|
| 135 |
languages=["python", "javascript", "java", "go", "php", "ruby"]
|
| 136 |
)
|
|
|
|
| 152 |
- General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3`
|
| 153 |
- Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct`
|
| 154 |
|
|
|
|
|
|
|
| 155 |
- **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach:
|
| 156 |
1. Model2Vec distillation (basic static embeddings)
|
| 157 |
2. Feature extraction using sentence transformers
|