| # Code-Specialized Model2Vec Distillation Analysis | |
| ## π― Executive Summary | |
| This report presents a comprehensive analysis of Model2Vec distillation experiments using different teacher models for code-specialized embedding generation. | |
| ### Evaluated Models Overview | |
| **Simplified Distillation Models:** 14 | |
| **Peer Comparison Models:** 19 | |
| **Total Models Analyzed:** 33 | |
| ### Best Performing Simplified Model: code_model2vec_all_mpnet_base_v2 | |
| **Overall CodeSearchNet Performance:** | |
| - **NDCG@10**: 0.7387 | |
| - **Mean Reciprocal Rank (MRR)**: 0.7010 | |
| - **Recall@5**: 0.8017 | |
| - **Mean Rank**: 6.4 | |
| ## π Comprehensive Model Comparison | |
| ### All Simplified Distillation Models Performance | |
| | Model | Teacher | NDCG@10 | MRR | Recall@5 | Status | | |
| |-------|---------|---------|-----|----------|--------| | |
| | code_model2vec_all_mpnet_base_v2 | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.7387 | 0.7010 | 0.8017 | π₯ Best | | |
| | code_model2vec_all_MiniLM_L6_v2 | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 0.7385 | 0.7049 | 0.7910 | π₯ 2nd | | |
| | code_model2vec_jina_embeddings_v2_base_code | [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code) | 0.7381 | 0.6996 | 0.8130 | π₯ 3rd | | |
| | code_model2vec_paraphrase_MiniLM_L6_v2 | [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) | 0.7013 | 0.6638 | 0.7665 | #4 | | |
| | code_model2vec_Reason_ModernColBERT | [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) | 0.6598 | 0.6228 | 0.7260 | #5 | | |
| | code_model2vec_all_mpnet_base_v2_fine_tuned | [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 0.6147 | 0.5720 | 0.6950 | #6 | | |
| | code_model2vec_bge_m3 | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 0.4863 | 0.4439 | 0.5514 | #7 | | |
| | code_model2vec_jina_embeddings_v3 | [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 0.4755 | 0.4416 | 0.5456 | #8 | | |
| | code_model2vec_nomic_embed_text_v2_moe | [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) | 0.4532 | 0.4275 | 0.5094 | #9 | | |
| | code_model2vec_gte_Qwen2_1.5B_instruct | [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | 0.4238 | 0.3879 | 0.4719 | #10 | | |
| | code_model2vec_Qodo_Embed_1_1.5B | [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) | 0.4101 | 0.3810 | 0.4532 | #11 | | |
| | code_model2vec_graphcodebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.3420 | 0.3140 | 0.3704 | #12 | | |
| | code_model2vec_Linq_Embed_Mistral | [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 0.2868 | 0.2581 | 0.3412 | #13 | | |
| | code_model2vec_codebert_base | [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) | 0.2779 | 0.2534 | 0.3136 | #14 | | |
| ### π Model Specifications Analysis | |
| Our distilled models exhibit consistent architectural characteristics across different teacher models: | |
| | Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size | | |
| |-------|----------------|------------|---------------|-----------| | |
| | all_mpnet_base_v2 | 29,528 | 7.6M | 256 | 14.4MB | | |
| | all_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB | | |
| | jina_embeddings_v2_base_code | 61,053 | 15.6M | 256 | 29.8MB | | |
| | paraphrase_MiniLM_L6_v2 | 29,525 | 7.6M | 256 | 14.4MB | | |
| | Reason_ModernColBERT | 50,254 | 12.9M | 256 | 24.5MB | | |
| | all_mpnet_base_v2_fine_tuned | 36,624 | 9.4M | 256 | 35.8MB | | |
| | bge_m3 | 249,999 | 64.0M | 256 | 122.1MB | | |
| | jina_embeddings_v3 | 249,999 | 64.0M | 256 | 122.1MB | | |
| | nomic_embed_text_v2_moe | 249,999 | 64.0M | 256 | 122.1MB | | |
| | gte_Qwen2_1.5B_instruct | 151,644 | 38.8M | 256 | 74.0MB | | |
| | Qodo_Embed_1_1.5B | 151,644 | 38.8M | 256 | 74.0MB | | |
| | graphcodebert_base | 50,262 | 12.9M | 256 | 24.5MB | | |
| | Linq_Embed_Mistral | 31,999 | 8.2M | 256 | 15.6MB | | |
| | codebert_base | 50,262 | 12.9M | 256 | 24.5MB | | |
|  | |
| *Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.* | |
| #### Key Insights from Model Specifications: | |
| - **Vocabulary Consistency**: All models use vocabulary sizes ranging from 29,525 to 249,999 tokens (avg: 101,594) | |
| - **Parameter Efficiency**: Models range from 7.6M to 64.0M parameters (avg: 26.0M) | |
| - **Storage Efficiency**: Disk usage ranges from 14.4MB to 122.1MB (avg: 50.9MB) | |
| - **Embedding Dimensions**: Consistent 256 dimensions across all models (optimized for efficiency) | |
| ### Key Findings | |
| - **Best Teacher Model**: code_model2vec_all_mpnet_base_v2 (NDCG@10: 0.7387) | |
| - **Least Effective Teacher**: code_model2vec_codebert_base (NDCG@10: 0.2779) | |
| - **Performance Range**: 62.4% difference between best and worst | |
| - **Average Performance**: 0.5248 NDCG@10 | |
| ## π― Language Performance Radar Charts | |
| ### Best Model vs Peer Models Comparison | |
|  | |
| *Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.* | |
| ### Individual Model Performance by Language | |
| #### code_model2vec_all_mpnet_base_v2 (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.7387 | |
|  | |
| #### code_model2vec_all_MiniLM_L6_v2 (Teacher: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) - NDCG@10: 0.7385 | |
|  | |
| #### code_model2vec_jina_embeddings_v2_base_code (Teacher: [jina-embeddings-v2-base-code](https://huggingface.co/jina-embeddings-v2-base-code)) - NDCG@10: 0.7381 | |
|  | |
| #### code_model2vec_paraphrase_MiniLM_L6_v2 (Teacher: [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)) - NDCG@10: 0.7013 | |
|  | |
| #### code_model2vec_Reason_ModernColBERT (Teacher: [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT)) - NDCG@10: 0.6598 | |
|  | |
| #### code_model2vec_all_mpnet_base_v2_fine_tuned (Teacher: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) - NDCG@10: 0.6147 | |
|  | |
| #### code_model2vec_bge_m3 (Teacher: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)) - NDCG@10: 0.4863 | |
|  | |
| #### code_model2vec_jina_embeddings_v3 (Teacher: [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)) - NDCG@10: 0.4755 | |
|  | |
| #### code_model2vec_nomic_embed_text_v2_moe (Teacher: [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe)) - NDCG@10: 0.4532 | |
|  | |
| #### code_model2vec_gte_Qwen2_1.5B_instruct (Teacher: [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)) - NDCG@10: 0.4238 | |
|  | |
| #### code_model2vec_Qodo_Embed_1_1.5B (Teacher: [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B)) - NDCG@10: 0.4101 | |
|  | |
| #### code_model2vec_graphcodebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.3420 | |
|  | |
| #### code_model2vec_Linq_Embed_Mistral (Teacher: [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)) - NDCG@10: 0.2868 | |
|  | |
| #### code_model2vec_codebert_base (Teacher: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)) - NDCG@10: 0.2779 | |
|  | |
| ## π Peer Model Comparison | |
|  | |
| *Comparison with established code-specialized embedding models using actual evaluation results.* | |
| ### Complete Model Ranking | |
| | Rank | Model | Type | NDCG@10 | MRR | Recall@5 | | |
| |------|-------|------|---------|-----|----------| | |
| | 1 | Alibaba-NLP/gte-Qwen2-1.5B-instruct | General | 0.9729 | 0.9676 | 0.9825 | | |
| | 2 | Qodo/Qodo-Embed-1-1.5B | General | 0.9715 | 0.9659 | 0.9875 | | |
| | 3 | jina-embeddings-v2-base-code | General | 0.9677 | 0.9618 | 0.9849 | | |
| | 4 | jinaai/jina-embeddings-v3 | General | 0.9640 | 0.9573 | 0.9839 | | |
| | 5 | sentence-transformers/all-mpnet-base-v2 | General | 0.9477 | 0.9358 | 0.9732 | | |
| | 6 | nomic-ai/nomic-embed-text-v2-moe | General | 0.9448 | 0.9357 | 0.9659 | | |
| | 7 | sentence-transformers/all-MiniLM-L12-v2 | General | 0.9398 | 0.9265 | 0.9732 | | |
| | 8 | BAAI/bge-m3 | General | 0.9383 | 0.9295 | 0.9643 | | |
| | 9 | sentence-transformers/all-MiniLM-L6-v2 | General | 0.9255 | 0.9099 | 0.9642 | | |
| | 10 | lightonai/Reason-ModernColBERT | General | 0.9188 | 0.9036 | 0.9486 | | |
| | 11 | Linq-AI-Research/Linq-Embed-Mistral | General | 0.9080 | 0.8845 | 0.9650 | | |
| | 12 | sentence-transformers/paraphrase-MiniLM-L6-v2 | General | 0.8297 | 0.8016 | 0.8828 | | |
| | 13 | minishlab/potion-base-8M | Model2Vec | 0.8162 | 0.7817 | 0.8931 | | |
| | 14 | minishlab/potion-retrieval-32M | Model2Vec | 0.8137 | 0.7810 | 0.8792 | | |
| | 15 | code_model2vec_all_mpnet_base_v2 | **π₯ Simplified Distillation** | 0.7387 | 0.7010 | 0.8017 | | |
| | 16 | code_model2vec_all_MiniLM_L6_v2 | **π₯ Simplified Distillation** | 0.7385 | 0.7049 | 0.7910 | | |
| | 17 | code_model2vec_jina_embeddings_v2_base_code | **π₯ Simplified Distillation** | 0.7381 | 0.6996 | 0.8130 | | |
| | 18 | code_model2vec_paraphrase_MiniLM_L6_v2 | **π₯ Simplified Distillation** | 0.7013 | 0.6638 | 0.7665 | | |
| | 19 | code_model2vec_Reason_ModernColBERT | **π₯ Simplified Distillation** | 0.6598 | 0.6228 | 0.7260 | | |
| | 20 | code_model2vec_all_mpnet_base_v2_fine_tuned | **π Fine-tuned Distillation** | 0.6147 | 0.5720 | 0.6950 | | |
| | 21 | potion-multilingual-128M | Model2Vec | 0.6124 | 0.5683 | 0.7017 | | |
| | 22 | huggingface/CodeBERTa-small-v1 | Code-Specific | 0.5903 | 0.5350 | 0.6779 | | |
| | 23 | Salesforce/codet5-base | Code-Specific | 0.4872 | 0.4500 | 0.5742 | | |
| | 24 | code_model2vec_bge_m3 | **π₯ Simplified Distillation** | 0.4863 | 0.4439 | 0.5514 | | |
| | 25 | code_model2vec_jina_embeddings_v3 | **π₯ Simplified Distillation** | 0.4755 | 0.4416 | 0.5456 | | |
| | 26 | code_model2vec_nomic_embed_text_v2_moe | **π₯ Simplified Distillation** | 0.4532 | 0.4275 | 0.5094 | | |
| | 27 | code_model2vec_gte_Qwen2_1.5B_instruct | **π₯ Simplified Distillation** | 0.4238 | 0.3879 | 0.4719 | | |
| | 28 | code_model2vec_Qodo_Embed_1_1.5B | **π₯ Simplified Distillation** | 0.4101 | 0.3810 | 0.4532 | | |
| | 29 | microsoft/graphcodebert-base | Code-Specific | 0.4039 | 0.3677 | 0.4650 | | |
| | 30 | code_model2vec_graphcodebert_base | **π₯ Simplified Distillation** | 0.3420 | 0.3140 | 0.3704 | | |
| | 31 | code_model2vec_Linq_Embed_Mistral | **π₯ Simplified Distillation** | 0.2868 | 0.2581 | 0.3412 | | |
| | 32 | code_model2vec_codebert_base | **π₯ Simplified Distillation** | 0.2779 | 0.2534 | 0.3136 | | |
| | 33 | microsoft/codebert-base | Code-Specific | 0.1051 | 0.1058 | 0.1105 | | |
| ## π Performance Analysis | |
| ### Multi-Model Comparison Charts | |
|  | |
| *Comprehensive comparison across all evaluation metrics.* | |
| ### Language Performance Analysis | |
|  | |
| *Performance heatmap showing how different models perform across programming languages.* | |
| ### Efficiency Analysis | |
|  | |
| *Performance vs model size analysis showing the efficiency benefits of distillation.* | |
| ## β‘ Operational Performance Analysis | |
|  | |
| *Comprehensive performance benchmarking across multiple operational metrics.* | |
| ### Performance Scaling Analysis | |
|  | |
| *How performance scales with different batch sizes for optimal throughput.* | |
|  | |
| *Memory usage patterns across different batch sizes.* | |
| ## π Language-Specific Analysis | |
| ### Performance by Programming Language | |
| | Language | Best Model Performance | Average Performance | Language Difficulty | | |
| |----------|------------------------|--------------------|--------------------| | |
| | Go | 0.9780 | 0.6960 | Easy | | |
| | Java | 0.9921 | 0.6553 | Easy | | |
| | Javascript | 0.9550 | 0.5850 | Easy | | |
| | Php | 1.0000 | 0.6321 | Easy | | |
| | Python | 1.0000 | 0.8623 | Easy | | |
| | Ruby | 0.9493 | 0.6397 | Easy | | |
| ## π― Conclusions and Recommendations | |
| ### Teacher Model Analysis | |
| Based on the evaluation results across all simplified distillation models: | |
| 1. **Best Teacher Model**: sentence-transformers/all-MiniLM-L6-v2 (NDCG@10: 0.7385) | |
| 2. **Least Effective Teacher**: microsoft/codebert-base (NDCG@10: 0.2779) | |
| 3. **Teacher Model Impact**: Choice of teacher model affects performance by 62.4% | |
| ### Recommendations | |
| - **For Production**: Use sentence-transformers/all-MiniLM-L6-v2 as teacher model for best performance | |
| - **For Efficiency**: Model2Vec distillation provides significant size reduction with competitive performance | |
| - **For Code Tasks**: Specialized models consistently outperform general-purpose models | |
| ## π Methodology | |
| ### Evaluation Protocol | |
| - **Dataset**: CodeSearchNet test sets for 6 programming languages | |
| - **Metrics**: NDCG@k, MRR, Recall@k following CodeSearchNet methodology | |
| - **Query Format**: Natural language documentation strings | |
| - **Corpus Format**: Function code strings | |
| - **Evaluation**: Retrieval of correct code for each documentation query | |
| ### Teacher Models Tested | |
| - [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (proven baseline) | |
| - [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) (general purpose) | |
| - [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) (paraphrase model) | |
| - [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) (code-specialized) | |
| - [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) (graph-aware code model) | |
| - [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) (instruction model) | |
| - [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (multilingual model) | |
| - [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) (modern embedding model) | |
| - [nomic-ai/nomic-embed-text-v2-moe](https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe) (mixture of experts) | |
| - [Qodo/Qodo-Embed-1-1.5B](https://huggingface.co/Qodo/Qodo-Embed-1-1.5B) (code-specialized) | |
| - [lightonai/Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) (ColBERT architecture) | |
| - [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) (Mistral-based) | |
| - [BAAI/bge-code-v1](https://huggingface.co/BAAI/bge-code-v1) (code-specialized BGE) | |
| - [Salesforce/SFR-Embedding-Code-2B_R](https://huggingface.co/Salesforce/SFR-Embedding-Code-2B_R) (large code model) | |
| ### Distillation Method | |
| - **Technique**: Model2Vec static embedding generation | |
| - **Parameters**: PCA dims=256, SIF coefficient=1e-3, Zipf weighting=True | |
| - **Training Data**: CodeSearchNet comment-code pairs | |
| - **Languages**: Python, JavaScript, Java, PHP, Ruby, Go | |
| --- | |
| *Report generated on 2025-06-01 08:04:06 using automated analysis pipeline.* | |
| *For questions about methodology or results, please refer to the CodeSearchNet documentation.* | |