Sarthak
commited on
Commit
·
ee673cb
1
Parent(s):
53a6528
feat: introduce distiller package and update README
Browse filesThis commit introduces the 'distiller' package, a toolkit for creating code-specialized static embeddings through Model2Vec distillation and Tokenlearn training. The updated README provides comprehensive documentation and usage examples for the distiller, highlighting its performance benefits and cloud-scale processing capabilities. Additionally, the REPORT.md provides a performance analysis of different Model2Vec distillation experiments.
- README.md +314 -159
- src/distiller/analyze.py +224 -10
README.md
CHANGED
|
@@ -1,196 +1,351 @@
|
|
| 1 |
---
|
| 2 |
-
base_model:
|
| 3 |
-
library_name:
|
| 4 |
license: apache-2.0
|
| 5 |
license_name: apache-2.0
|
| 6 |
license_link: LICENSE
|
| 7 |
-
model_name:
|
| 8 |
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- sentence-transformers
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
|
| 13 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
#
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
## Model
|
| 32 |
|
| 33 |
-
- **Model
|
| 34 |
-
- **
|
| 35 |
-
- **
|
| 36 |
-
- **
|
| 37 |
-
- **
|
| 38 |
-
- **
|
| 39 |
-
- **Size
|
| 40 |
-
- **
|
|
|
|
| 41 |
|
| 42 |
-
## Installation
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
```bash
|
| 47 |
-
# Install
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
```
|
| 50 |
|
| 51 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
```bash
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
```
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
| 64 |
|
| 65 |
```bash
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
```
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
|
|
|
| 73 |
```bash
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
```
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
-
|
| 86 |
-
-
|
| 87 |
-
-
|
| 88 |
-
-
|
| 89 |
-
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
-
|
| 130 |
-
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
- `evaluation/` - Directory containing evaluation results and visualizations
|
| 134 |
-
- `trained_code_classifier/` - Directory containing trained classification model
|
| 135 |
-
- `mteb_results/` - Directory containing MTEB evaluation results
|
| 136 |
-
|
| 137 |
-
## MTEB Benchmark Results (Partial)
|
| 138 |
-
|
| 139 |
-
**Overall Average Score: 0.1962**
|
| 140 |
-
|
| 141 |
-
| Category | Task | Score |
|
| 142 |
-
|----------|------|-------|
|
| 143 |
-
| **Classification** | **Average** | **0.4164** |
|
| 144 |
-
| | AmazonCounterfactualClassification | 0.5690 |
|
| 145 |
-
| | AmazonReviewsClassification | 0.2637 |
|
| 146 |
-
| | | |
|
| 147 |
-
| **Clustering** | **Average** | **0.0775** |
|
| 148 |
-
| | BiorxivClusteringS2S | 0.0775 |
|
| 149 |
-
| | | |
|
| 150 |
-
| **Reranking** | **Average** | **0.4643** |
|
| 151 |
-
| | AskUbuntuDupQuestions | 0.4643 |
|
| 152 |
-
| | | |
|
| 153 |
-
| **Retrieval** | **Average** | **0.1509** |
|
| 154 |
-
| | ArguAna | 0.1509 |
|
| 155 |
-
| | | |
|
| 156 |
-
| **CodeRetrieval** | **Average** | **0.1034** |
|
| 157 |
-
| | AppsRetrieval | 0.0008 |
|
| 158 |
-
| | COIRCodeSearchNetRetrieval | Failed |
|
| 159 |
-
| | CodeFeedbackMT | 0.1594 |
|
| 160 |
-
| | CodeSearchNetCCRetrieval | Failed |
|
| 161 |
-
| | CodeTransOceanContest | 0.0951 |
|
| 162 |
-
| | CodeTransOceanDL | 0.2780 |
|
| 163 |
-
| | CosQA | 0.0097 |
|
| 164 |
-
| | StackOverflowQA | 0.1762 |
|
| 165 |
-
| | SyntheticText2SQL | 0.0049 |
|
| 166 |
-
| | | |
|
| 167 |
-
| **STS** | **Average** | **0.3016** |
|
| 168 |
-
| | BIOSSES | 0.3016 |
|
| 169 |
-
| | | |
|
| 170 |
-
|
| 171 |
-
### Summary Statistics
|
| 172 |
-
|
| 173 |
-
- **Total Tasks**: 15
|
| 174 |
-
- **Successful Tasks**: 13
|
| 175 |
-
- **Failed Tasks**: 2
|
| 176 |
-
- **Overall Average**: 0.1962
|
| 177 |
-
|
| 178 |
-
### Category Averages
|
| 179 |
-
|
| 180 |
-
- **Classification**: 0.4164 (2 tasks)
|
| 181 |
-
- **Clustering**: 0.0775 (1 tasks)
|
| 182 |
-
- **Reranking**: 0.4643 (1 tasks)
|
| 183 |
-
- **Retrieval**: 0.1509 (1 tasks)
|
| 184 |
-
- **CodeRetrieval**: 0.1034 (7 tasks)
|
| 185 |
-
- **STS**: 0.3016 (1 tasks)
|
| 186 |
-
|
| 187 |
-
## Acknowledgments
|
| 188 |
-
|
| 189 |
-
This project is built upon the following technologies:
|
| 190 |
-
|
| 191 |
-
- [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) - The original embedding model developed by Alibaba-NLP
|
| 192 |
-
- [Model2Vec](https://github.com/MinishLab/model2vec) - The distillation technique used to optimize the model
|
| 193 |
-
|
| 194 |
-
## License
|
| 195 |
-
|
| 196 |
-
This model is licensed under the [Apache 2.0](LICENSE) license, the same as the original gte-Qwen2-7B-instruct model.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: sentence-transformers/all-mpnet-base-v2
|
| 3 |
+
library_name: distiller
|
| 4 |
license: apache-2.0
|
| 5 |
license_name: apache-2.0
|
| 6 |
license_link: LICENSE
|
| 7 |
+
model_name: codemalt-base-8m
|
| 8 |
tags:
|
| 9 |
+
- code-search
|
| 10 |
+
- code-embeddings
|
| 11 |
+
- model2vec
|
| 12 |
+
- distillation
|
| 13 |
- sentence-transformers
|
| 14 |
+
- static-embeddings
|
| 15 |
+
- tokenlearn
|
| 16 |
+
datasets:
|
| 17 |
+
- code_search_net
|
| 18 |
+
metrics:
|
| 19 |
+
- ndcg@10
|
| 20 |
+
- mrr
|
| 21 |
+
- recall@5
|
| 22 |
+
language:
|
| 23 |
+
- code
|
| 24 |
+
pipeline_tag: feature-extraction
|
| 25 |
---
|
| 26 |
|
| 27 |
+
# CodeMalt-Base-8M
|
| 28 |
|
| 29 |
+
**CodeMalt-Base-8M** is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves **73.87% NDCG@10** on CodeSearchNet benchmarks while being **14x smaller** and **15,021x faster** than the original teacher model.
|
| 30 |
|
| 31 |
+
## 🏆 Performance Highlights
|
| 32 |
|
| 33 |
+
- **NDCG@10**: 0.7387 (Best among all distilled models)
|
| 34 |
+
- **Mean Reciprocal Rank (MRR)**: 0.7010
|
| 35 |
+
- **Recall@5**: 0.8017
|
| 36 |
+
- **Model Size**: 7.6M parameters (vs 109M original)
|
| 37 |
+
- **Inference Speed**: 15,021x faster than teacher model
|
| 38 |
+
- **Memory Usage**: <1GB RAM (vs 8+ GB VRAM for original)
|
| 39 |
|
| 40 |
+
## 📊 CodeSearchNet Performance by Language
|
| 41 |
|
| 42 |
+
| Language | NDCG@10 | MRR | Recall@5 |
|
| 43 |
+
|----------|---------|-----|----------|
|
| 44 |
+
| **Python** | 0.7899 | 0.7501 | 0.8421 |
|
| 45 |
+
| **JavaScript** | 0.7234 | 0.6801 | 0.7895 |
|
| 46 |
+
| **Java** | 0.7456 | 0.7089 | 0.8123 |
|
| 47 |
+
| **PHP** | 0.7198 | 0.6856 | 0.7834 |
|
| 48 |
+
| **Ruby** | 0.7312 | 0.6934 | 0.7912 |
|
| 49 |
+
| **Go** | 0.7223 | 0.6876 | 0.7913 |
|
| 50 |
|
| 51 |
+
## 🔧 Model Details
|
| 52 |
|
| 53 |
+
- **Teacher Model**: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
|
| 54 |
+
- **Distillation Method**: Model2Vec + Tokenlearn training on CodeSearchNet
|
| 55 |
+
- **Architecture**: Static embeddings (no neural network inference required)
|
| 56 |
+
- **Embedding Dimensions**: 256
|
| 57 |
+
- **Training Data**: CodeSearchNet code-comment pairs across 6 programming languages
|
| 58 |
+
- **Optimization**: PCA dimensionality reduction + SIF weighting + Zipf regularization
|
| 59 |
+
- **Vocabulary Size**: 29,528
|
| 60 |
+
- **Parameters**: 7.6M
|
| 61 |
+
- **Size**: 14.4MB
|
| 62 |
|
|
|
|
| 63 |
|
| 64 |
+
## 🎯 Distiller: Code-Specialized Embedding Toolkit
|
| 65 |
+
|
| 66 |
+
**Distiller** is an independent toolkit built upon [Model2Vec](https://github.com/MinishLab/model2vec) and [Tokenlearn](https://github.com/MinishLab/tokenlearn) for creating code-specialized static embeddings. This package provides a complete pipeline for distilling, training, and evaluating efficient embedding models optimized for code-related tasks.
|
| 67 |
+
|
| 68 |
+
> **Note**: This is an independent research project that builds upon the Model2Vec framework. We are not affiliated with the MinishLab Model2Vec team, but acknowledge their excellent foundational work.
|
| 69 |
+
|
| 70 |
+
>[!Important]
|
| 71 |
+
>Check out the comprehensive [REPORT.md](REPORT.md) file generated by this toolkit for detailed performance analysis, model comparisons, and evaluation results across different programming languages.
|
| 72 |
+
|
| 73 |
+
The **distiller** package provides a complete pipeline for:
|
| 74 |
+
|
| 75 |
+
1. **Distilling code-specialized embeddings** from large sentence transformer models using Model2Vec
|
| 76 |
+
2. **Comprehensive evaluation** on CodeSearchNet benchmarks across 6 programming languages
|
| 77 |
+
3. **Performance benchmarking** (speed, memory, model size analysis)
|
| 78 |
+
4. **Advanced training** with tokenlearn for enhanced code understanding
|
| 79 |
+
5. **Analysis and reporting** with visualizations and comparison charts
|
| 80 |
+
6. **Cloud-scale processing** with Beam support for distributed execution
|
| 81 |
+
|
| 82 |
+
### Key Benefits
|
| 83 |
+
|
| 84 |
+
- **🚀 Performance**: Up to 500x faster inference with 50x smaller models
|
| 85 |
+
- **📊 Code-Optimized**: Specialized for code search, classification, and similarity tasks
|
| 86 |
+
- **🔬 Comprehensive**: Full evaluation pipeline with CodeSearchNet metrics
|
| 87 |
+
- **☁️ Scalable**: Local and cloud execution with Beam support
|
| 88 |
+
- **📈 Analytical**: Rich reporting with performance charts and comparisons
|
| 89 |
+
|
| 90 |
+
## 🚀 Quick Start
|
| 91 |
+
|
| 92 |
+
### Installation
|
| 93 |
|
| 94 |
```bash
|
| 95 |
+
# Install with all dependencies
|
| 96 |
+
pip install model2vec[train] torch transformers datasets sentence-transformers
|
| 97 |
+
pip install typer pydantic plotly matplotlib seaborn
|
| 98 |
+
|
| 99 |
+
# Install the distiller package (assuming local development)
|
| 100 |
+
pip install -e .
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Basic Usage
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
# Simple distillation of a teacher model
|
| 107 |
+
distiller distill
|
| 108 |
+
|
| 109 |
+
# Distillation with advanced CodeSearchNet training
|
| 110 |
+
distiller distill --train
|
| 111 |
+
|
| 112 |
+
# Evaluate distilled models on CodeSearchNet
|
| 113 |
+
distiller evaluate
|
| 114 |
+
|
| 115 |
+
# Generate comprehensive analysis report
|
| 116 |
+
distiller analyze
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
### Python API
|
| 120 |
+
|
| 121 |
+
```python
|
| 122 |
+
from distiller import distill, evaluate, analyze
|
| 123 |
+
|
| 124 |
+
# Distill a specific model
|
| 125 |
+
results = distill.run_local_distillation(
|
| 126 |
+
teacher_models=["microsoft/codebert-base"],
|
| 127 |
+
enable_training=True, # Include CodeSearchNet fine-tuning
|
| 128 |
+
pca_dims=256
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
# Evaluate on CodeSearchNet
|
| 132 |
+
evaluation_results = evaluate.run_evaluation(
|
| 133 |
+
models=["./code_model2vec/final/codemalt-base-8m"],
|
| 134 |
+
max_queries=1000,
|
| 135 |
+
languages=["python", "javascript", "java", "go", "php", "ruby"]
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
# Generate analysis report
|
| 139 |
+
analyze.main(
|
| 140 |
+
results_dir="./code_model2vec/evaluation_results",
|
| 141 |
+
model_name="code_model2vec_distilled_models",
|
| 142 |
+
output="ANALYSIS_REPORT.md"
|
| 143 |
+
)
|
| 144 |
```
|
| 145 |
|
| 146 |
+
## 📋 Features
|
| 147 |
+
|
| 148 |
+
### 🔬 Distillation Engine
|
| 149 |
+
|
| 150 |
+
- **Multiple Teacher Models**: Support for 15+ pre-configured teacher models including:
|
| 151 |
+
- Code-specialized: `microsoft/codebert-base`, `BAAI/bge-code-v1`, `Salesforce/SFR-Embedding-Code-2B_R`
|
| 152 |
+
- General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3`
|
| 153 |
+
- Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct`
|
| 154 |
+
|
| 155 |
+
- **CodeMalt Model Series**: Our flagship models follow the naming convention `codemalt-base-[N]m` where `[N]m` indicates millions of parameters (e.g., `codemalt-base-8m` has ~7.6 million parameters)
|
| 156 |
+
|
| 157 |
+
- **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach:
|
| 158 |
+
1. Model2Vec distillation (basic static embeddings)
|
| 159 |
+
2. Feature extraction using sentence transformers
|
| 160 |
+
3. Tokenlearn training on CodeSearchNet data
|
| 161 |
+
4. Post-training re-regularization (PCA + SIF weighting)
|
| 162 |
+
|
| 163 |
+
- **Robust Model Handling**: Automatic compatibility checks and specialized handling for problematic models
|
| 164 |
+
|
| 165 |
+
### 📊 Evaluation Framework
|
| 166 |
+
|
| 167 |
+
- **CodeSearchNet Evaluation**: Standard code search benchmarks across 6 programming languages
|
| 168 |
+
- **Retrieval Metrics**: NDCG@k, MRR, Recall@k, Mean/Median Rank
|
| 169 |
+
- **Performance Benchmarking**:
|
| 170 |
+
- Model size analysis (disk usage, parameters, memory footprint)
|
| 171 |
+
- Inference speed testing (various batch sizes and text lengths)
|
| 172 |
+
- CPU vs GPU performance comparison
|
| 173 |
+
- Memory scaling analysis
|
| 174 |
+
|
| 175 |
+
### 📈 Analysis & Reporting
|
| 176 |
|
| 177 |
+
- **Comprehensive Reports**: Automated generation of analysis reports with:
|
| 178 |
+
- Performance comparison tables
|
| 179 |
+
- Language-specific radar charts
|
| 180 |
+
- Efficiency analysis (performance vs model size)
|
| 181 |
+
- Peer model comparisons
|
| 182 |
|
| 183 |
+
- **Rich Visualizations**: Plotly and Matplotlib charts including:
|
| 184 |
+
- Multi-model performance heatmaps
|
| 185 |
+
- Batch size scaling curves
|
| 186 |
+
- Memory usage patterns
|
| 187 |
+
- Model efficiency scatter plots
|
| 188 |
+
|
| 189 |
+
### ☁️ Cloud Integration
|
| 190 |
+
|
| 191 |
+
- **Beam Support**: Distributed execution on Beam cloud infrastructure
|
| 192 |
+
- **Volume Management**: Persistent storage with checkpoint support
|
| 193 |
+
- **Resource Optimization**: GPU-optimized configurations (A100-40G default)
|
| 194 |
+
- **Automatic Syncing**: Seamless model and result synchronization
|
| 195 |
+
|
| 196 |
+
## 🛠️ CLI Reference
|
| 197 |
+
|
| 198 |
+
### `distiller distill`
|
| 199 |
+
|
| 200 |
+
Distill teacher models into efficient static embeddings.
|
| 201 |
|
| 202 |
```bash
|
| 203 |
+
distiller distill [OPTIONS]
|
| 204 |
+
|
| 205 |
+
Options:
|
| 206 |
+
--use-beam Use Beam cloud for distillation
|
| 207 |
+
--train Enable advanced training (CodeSearchNet fine-tuning)
|
| 208 |
+
--teacher-models TEXT Specific teacher models to distill (can be repeated)
|
| 209 |
+
--pca-dims INTEGER PCA dimensions (default: 256)
|
| 210 |
+
--clear-cache Clear HuggingFace cache for problematic models
|
| 211 |
```
|
| 212 |
|
| 213 |
+
**Examples:**
|
| 214 |
+
```bash
|
| 215 |
+
# Basic distillation of all default models
|
| 216 |
+
distiller distill
|
| 217 |
+
|
| 218 |
+
# Train specific models with advanced CodeSearchNet fine-tuning
|
| 219 |
+
distiller distill --train --teacher-models microsoft/codebert-base --teacher-models BAAI/bge-code-v1
|
| 220 |
+
|
| 221 |
+
# Use Beam cloud with custom PCA dimensions
|
| 222 |
+
distiller distill --use-beam --train --pca-dims 512
|
| 223 |
+
```
|
| 224 |
|
| 225 |
+
### `distiller evaluate`
|
| 226 |
+
|
| 227 |
+
Evaluate models on CodeSearchNet benchmarks with performance analysis.
|
| 228 |
|
| 229 |
```bash
|
| 230 |
+
distiller evaluate [OPTIONS]
|
| 231 |
+
|
| 232 |
+
Options:
|
| 233 |
+
--use-beam Use Beam cloud for evaluation
|
| 234 |
+
--skip-third-party Skip third-party models evaluation
|
| 235 |
+
--skip-benchmark Skip performance benchmarking
|
| 236 |
+
--max-queries INTEGER Maximum queries per language (default: 100)
|
| 237 |
```
|
| 238 |
|
| 239 |
+
**Examples:**
|
| 240 |
+
```bash
|
| 241 |
+
# Comprehensive evaluation with benchmarking
|
| 242 |
+
distiller evaluate --max-queries 1000
|
| 243 |
+
|
| 244 |
+
# Quick evaluation without performance benchmarks
|
| 245 |
+
distiller evaluate --skip-benchmark --max-queries 100
|
| 246 |
+
|
| 247 |
+
# Cloud-based evaluation
|
| 248 |
+
distiller evaluate --use-beam --max-queries 500
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
### `distiller analyze`
|
| 252 |
+
|
| 253 |
+
Generate comprehensive analysis reports with visualizations.
|
| 254 |
+
|
| 255 |
+
```bash
|
| 256 |
+
distiller analyze [OPTIONS]
|
| 257 |
|
| 258 |
+
Options:
|
| 259 |
+
--results-dir PATH Results directory (default: code_model2vec/evaluation_results)
|
| 260 |
+
--model-name TEXT Model name for analysis (default: gte_qwen2_m2v_code (Ours))
|
| 261 |
+
--output PATH Output report file (default: REPORT.md)
|
| 262 |
+
--export-csv PATH Export results to CSV file
|
| 263 |
+
```
|
| 264 |
|
| 265 |
+
**Examples:**
|
| 266 |
```bash
|
| 267 |
+
# Generate standard analysis report
|
| 268 |
+
distiller analyze
|
| 269 |
+
|
| 270 |
+
# Custom analysis with CSV export
|
| 271 |
+
distiller analyze --model-name "my_distilled_model" --output custom_report.md --export-csv results.csv
|
| 272 |
+
|
| 273 |
+
# Analyze specific results directory
|
| 274 |
+
distiller analyze --results-dir ./custom_results --output analysis.md
|
| 275 |
+
```
|
| 276 |
+
|
| 277 |
+
## 📁 Directory Structure
|
| 278 |
+
|
| 279 |
+
The distiller uses a standardized directory structure:
|
| 280 |
+
|
| 281 |
+
```
|
| 282 |
+
code_model2vec/
|
| 283 |
+
├── base/ # Basic distilled models (Step 1)
|
| 284 |
+
│ └── code_model2vec_{teacher_name}/
|
| 285 |
+
├── final/ # Final models (copied from base or after training)
|
| 286 |
+
│ └── code_model2vec_{teacher_name}[_fine_tuned]/
|
| 287 |
+
├── evaluation_results/ # CodeSearchNet evaluation results
|
| 288 |
+
│ └── comprehensive_eval_{model}.json
|
| 289 |
+
├── benchmark_results/ # Performance benchmark results
|
| 290 |
+
├── analysis_results/ # Analysis reports and charts
|
| 291 |
+
│ └── charts/
|
| 292 |
+
├── checkpoints/ # Training checkpoints
|
| 293 |
+
└── cache/ # Temporary cache files
|
| 294 |
```
|
| 295 |
|
| 296 |
+
## ⚙️ Configuration
|
| 297 |
+
|
| 298 |
+
### Teacher Models
|
| 299 |
+
|
| 300 |
+
Default supported teacher models (configured in `config.py`):
|
| 301 |
+
|
| 302 |
+
```python
|
| 303 |
+
TEACHER_MODELS = [
|
| 304 |
+
"Alibaba-NLP/gte-Qwen2-1.5B-instruct", # Instruction-tuned
|
| 305 |
+
"BAAI/bge-m3", # Multilingual
|
| 306 |
+
"jinaai/jina-embeddings-v3", # Modern architecture
|
| 307 |
+
"microsoft/codebert-base", # Code-specialized
|
| 308 |
+
"microsoft/graphcodebert-base", # Graph-aware code
|
| 309 |
+
"sentence-transformers/all-mpnet-base-v2", # General-purpose
|
| 310 |
+
# ... and more
|
| 311 |
+
]
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
### Distillation Parameters
|
| 315 |
+
|
| 316 |
+
```python
|
| 317 |
+
# Model2Vec distillation settings
|
| 318 |
+
optimal_pca_dims: int = 256
|
| 319 |
+
sif_coefficient: float = 1e-3
|
| 320 |
+
apply_zipf: bool = True
|
| 321 |
+
|
| 322 |
+
# Tokenlearn training settings (when --train is enabled)
|
| 323 |
+
tokenlearn_dataset: str = "sentence-transformers/codesearchnet"
|
| 324 |
+
tokenlearn_text_key: str = "code" # Use code field for training
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
### Evaluation Settings
|
| 328 |
+
|
| 329 |
+
```python
|
| 330 |
+
# CodeSearchNet evaluation
|
| 331 |
+
evaluation_languages = ["python", "java", "javascript", "php", "ruby", "go"]
|
| 332 |
+
max_queries_per_language: int = 1000
|
| 333 |
+
evaluation_metrics = ["ndcg@1", "ndcg@5", "ndcg@10", "mrr", "recall@1", "recall@5", "recall@10"]
|
| 334 |
+
```
|
| 335 |
+
|
| 336 |
+
## 📄 License
|
| 337 |
+
|
| 338 |
+
This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
|
| 339 |
+
|
| 340 |
+
## 🙏 Acknowledgments
|
| 341 |
+
|
| 342 |
+
This independent research project builds upon several excellent open-source foundations:
|
| 343 |
+
|
| 344 |
+
- [Model2Vec](https://github.com/MinishLab/model2vec) by MinishLab - Core static embedding distillation framework
|
| 345 |
+
- [Tokenlearn](https://github.com/MinishLab/tokenlearn) by MinishLab - Advanced token-level training methodology
|
| 346 |
+
- [CodeSearchNet](https://github.com/github/CodeSearchNet) by GitHub - Code search benchmark dataset and evaluation framework
|
| 347 |
+
- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) by UKP Lab - Teacher model ecosystem and training framework
|
| 348 |
+
- [Beam](https://beam.cloud) - Distributed cloud computing infrastructure
|
| 349 |
+
- [Transformers](https://github.com/huggingface/transformers) by Hugging Face - Model loading and tokenization utilities
|
| 350 |
+
|
| 351 |
+
**Note**: While this toolkit leverages Model2Vec and Tokenlearn, it is an independent research contribution and is not officially associated with or endorsed by the MinishLab team.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/distiller/analyze.py
CHANGED
|
@@ -304,6 +304,10 @@ def get_teacher_model_info(model_display_name: str) -> tuple[str, str]:
|
|
| 304 |
"https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct",
|
| 305 |
),
|
| 306 |
"bge_m3": ("BAAI/bge-m3", "https://huggingface.co/BAAI/bge-m3"),
|
|
|
|
|
|
|
|
|
|
|
|
|
| 307 |
"jina_embeddings_v3": ("jinaai/jina-embeddings-v3", "https://huggingface.co/jinaai/jina-embeddings-v3"),
|
| 308 |
"nomic_embed_text_v2_moe": (
|
| 309 |
"nomic-ai/nomic-embed-text-v2-moe",
|
|
@@ -349,6 +353,7 @@ class CodeSearchNetAnalyzer:
|
|
| 349 |
self.benchmark_results: list[dict[str, Any]] = []
|
| 350 |
self.comparison_df: pd.DataFrame | None = None
|
| 351 |
self.benchmark_df: pd.DataFrame | None = None
|
|
|
|
| 352 |
|
| 353 |
def load_benchmark_results(self) -> None:
|
| 354 |
"""Load benchmark results from comprehensive evaluation files."""
|
|
@@ -479,6 +484,73 @@ class CodeSearchNetAnalyzer:
|
|
| 479 |
|
| 480 |
self.benchmark_df = pd.DataFrame(benchmark_data)
|
| 481 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 482 |
def load_results(self) -> None:
|
| 483 |
"""Load evaluation results from local directory."""
|
| 484 |
logger.info("🔍 Loading evaluation results...")
|
|
@@ -526,6 +598,9 @@ class CodeSearchNetAnalyzer:
|
|
| 526 |
# Also load benchmark results
|
| 527 |
self.load_benchmark_results()
|
| 528 |
|
|
|
|
|
|
|
|
|
|
| 529 |
def _normalize_evaluation_data(self, data: dict, file_path: Path) -> dict[str, Any]:
|
| 530 |
"""Normalize evaluation data to consistent format for analysis."""
|
| 531 |
# Extract model name
|
|
@@ -774,6 +849,9 @@ class CodeSearchNetAnalyzer:
|
|
| 774 |
# Define colors for each model
|
| 775 |
colors = ["rgb(255, 99, 132)", "rgb(54, 162, 235)", "rgb(255, 205, 86)", "rgb(75, 192, 192)"]
|
| 776 |
|
|
|
|
|
|
|
|
|
|
| 777 |
for i, model_result in enumerate(models_to_compare):
|
| 778 |
model_name = model_result["model_name"]
|
| 779 |
languages = model_result.get("languages", {})
|
|
@@ -787,6 +865,7 @@ class CodeSearchNetAnalyzer:
|
|
| 787 |
if language_scores:
|
| 788 |
languages_list = list(language_scores.keys())
|
| 789 |
scores_list = list(language_scores.values())
|
|
|
|
| 790 |
|
| 791 |
# Close the radar chart
|
| 792 |
languages_closed = [*languages_list, languages_list[0]]
|
|
@@ -807,8 +886,16 @@ class CodeSearchNetAnalyzer:
|
|
| 807 |
)
|
| 808 |
)
|
| 809 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 810 |
fig.update_layout(
|
| 811 |
-
polar={"radialaxis": {"visible": True, "range": [0,
|
| 812 |
showlegend=True,
|
| 813 |
title="Model Comparison: Best Distilled vs Top Peer Models",
|
| 814 |
width=900,
|
|
@@ -1219,7 +1306,8 @@ class CodeSearchNetAnalyzer:
|
|
| 1219 |
# Safe conversion to float for pandas values
|
| 1220 |
score_value = pd.to_numeric(current_model_score, errors="coerce")
|
| 1221 |
scores.append(float(score_value) if not pd.isna(score_value) else 0.0)
|
| 1222 |
-
|
|
|
|
| 1223 |
is_user_model.append(False)
|
| 1224 |
|
| 1225 |
if not models:
|
|
@@ -1298,6 +1386,67 @@ class CodeSearchNetAnalyzer:
|
|
| 1298 |
|
| 1299 |
return str(output_path)
|
| 1300 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1301 |
def generate_comprehensive_report(self, model_name: str = "Simplified Distillation Models") -> str:
|
| 1302 |
"""Generate comprehensive markdown report for all evaluated models."""
|
| 1303 |
if not self.results:
|
|
@@ -1346,6 +1495,7 @@ class CodeSearchNetAnalyzer:
|
|
| 1346 |
heatmap_chart = self.plot_language_heatmap()
|
| 1347 |
peer_chart = self.create_peer_comparison_chart(main_model_name)
|
| 1348 |
efficiency_chart = self.create_efficiency_analysis(main_model_name)
|
|
|
|
| 1349 |
|
| 1350 |
# Generate individual radar charts for all simplified models
|
| 1351 |
individual_radar_charts = self.create_individual_radar_charts(simplified_models)
|
|
@@ -1413,6 +1563,60 @@ This report presents a comprehensive analysis of Model2Vec distillation experime
|
|
| 1413 |
|
| 1414 |
report += f"| {model_display} | {teacher_display} | {overall_metrics.get('ndcg@10', 0):.4f} | {overall_metrics.get('mrr', 0):.4f} | {overall_metrics.get('recall@5', 0):.4f} | {status} |\n"
|
| 1415 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1416 |
report += """
|
| 1417 |
|
| 1418 |
### Key Findings
|
|
@@ -1444,18 +1648,28 @@ This report presents a comprehensive analysis of Model2Vec distillation experime
|
|
| 1444 |
report += f"\n\n"
|
| 1445 |
report += "*Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.*\n\n"
|
| 1446 |
|
| 1447 |
-
# Add individual radar charts for all simplified models
|
| 1448 |
if individual_radar_charts:
|
| 1449 |
report += "### Individual Model Performance by Language\n\n"
|
| 1450 |
-
for chart_model_name, chart_path in individual_radar_charts.items():
|
| 1451 |
-
# Extract teacher name for cleaner display
|
| 1452 |
-
teacher_name, teacher_link = get_teacher_model_info(chart_model_name)
|
| 1453 |
|
| 1454 |
-
|
| 1455 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1456 |
|
| 1457 |
-
|
| 1458 |
-
|
| 1459 |
|
| 1460 |
report += f"""
|
| 1461 |
|
|
|
|
| 304 |
"https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct",
|
| 305 |
),
|
| 306 |
"bge_m3": ("BAAI/bge-m3", "https://huggingface.co/BAAI/bge-m3"),
|
| 307 |
+
"jina_embeddings_v2_base_code": (
|
| 308 |
+
"jina-embeddings-v2-base-code",
|
| 309 |
+
"https://huggingface.co/jina-embeddings-v2-base-code",
|
| 310 |
+
),
|
| 311 |
"jina_embeddings_v3": ("jinaai/jina-embeddings-v3", "https://huggingface.co/jinaai/jina-embeddings-v3"),
|
| 312 |
"nomic_embed_text_v2_moe": (
|
| 313 |
"nomic-ai/nomic-embed-text-v2-moe",
|
|
|
|
| 353 |
self.benchmark_results: list[dict[str, Any]] = []
|
| 354 |
self.comparison_df: pd.DataFrame | None = None
|
| 355 |
self.benchmark_df: pd.DataFrame | None = None
|
| 356 |
+
self.model_specs: dict[str, dict[str, Any]] = {} # Store actual model specifications
|
| 357 |
|
| 358 |
def load_benchmark_results(self) -> None:
|
| 359 |
"""Load benchmark results from comprehensive evaluation files."""
|
|
|
|
| 484 |
|
| 485 |
self.benchmark_df = pd.DataFrame(benchmark_data)
|
| 486 |
|
| 487 |
+
def analyze_our_model_specifications(self) -> None:
|
| 488 |
+
"""Analyze actual model specifications for our distilled models."""
|
| 489 |
+
logger.info("🔍 Analyzing model specifications for our distilled models...")
|
| 490 |
+
|
| 491 |
+
# Look for our models in the code_model2vec/final directory
|
| 492 |
+
final_models_dir = Path("code_model2vec/final")
|
| 493 |
+
|
| 494 |
+
if not final_models_dir.exists():
|
| 495 |
+
logger.warning(f"Final models directory not found: {final_models_dir}")
|
| 496 |
+
return
|
| 497 |
+
|
| 498 |
+
# Find all our model directories
|
| 499 |
+
our_model_dirs = []
|
| 500 |
+
for model_dir in final_models_dir.iterdir():
|
| 501 |
+
if model_dir.is_dir() and "code_model2vec" in model_dir.name:
|
| 502 |
+
our_model_dirs.append(model_dir)
|
| 503 |
+
|
| 504 |
+
logger.info(f"📁 Found {len(our_model_dirs)} distilled model directories")
|
| 505 |
+
|
| 506 |
+
for model_dir in our_model_dirs:
|
| 507 |
+
model_name = model_dir.name
|
| 508 |
+
logger.info(f"📊 Analyzing model: {model_name}")
|
| 509 |
+
|
| 510 |
+
try:
|
| 511 |
+
# Try to load the model and get specifications
|
| 512 |
+
from model2vec import StaticModel
|
| 513 |
+
|
| 514 |
+
model = StaticModel.from_pretrained(str(model_dir))
|
| 515 |
+
|
| 516 |
+
# Get model specifications
|
| 517 |
+
vocab_size = len(model.tokens)
|
| 518 |
+
embedding_dim = model.dim
|
| 519 |
+
total_params = vocab_size * embedding_dim
|
| 520 |
+
|
| 521 |
+
# Get file size information
|
| 522 |
+
model_file = model_dir / "model.safetensors"
|
| 523 |
+
disk_size_mb: float = 0.0
|
| 524 |
+
if model_file.exists():
|
| 525 |
+
disk_size_mb = float(model_file.stat().st_size / (1024 * 1024)) # Convert to MB
|
| 526 |
+
|
| 527 |
+
# Store specifications
|
| 528 |
+
self.model_specs[model_name] = {
|
| 529 |
+
"vocabulary_size": vocab_size,
|
| 530 |
+
"embedding_dimensions": embedding_dim,
|
| 531 |
+
"total_parameters": total_params,
|
| 532 |
+
"parameters_millions": total_params / 1_000_000,
|
| 533 |
+
"disk_size_mb": disk_size_mb,
|
| 534 |
+
"model_path": str(model_dir),
|
| 535 |
+
"analysis_successful": True,
|
| 536 |
+
}
|
| 537 |
+
|
| 538 |
+
logger.info(
|
| 539 |
+
f"✅ {model_name}: {vocab_size:,} vocab, {embedding_dim} dims, {total_params:,} params ({total_params / 1_000_000:.1f}M)"
|
| 540 |
+
)
|
| 541 |
+
|
| 542 |
+
except Exception as e:
|
| 543 |
+
logger.warning(f"❌ Failed to analyze {model_name}: {e}")
|
| 544 |
+
self.model_specs[model_name] = {
|
| 545 |
+
"analysis_successful": False,
|
| 546 |
+
"error": str(e),
|
| 547 |
+
"model_path": str(model_dir),
|
| 548 |
+
}
|
| 549 |
+
|
| 550 |
+
logger.info(
|
| 551 |
+
f"📊 Successfully analyzed {len([s for s in self.model_specs.values() if s.get('analysis_successful', False)])} models"
|
| 552 |
+
)
|
| 553 |
+
|
| 554 |
def load_results(self) -> None:
|
| 555 |
"""Load evaluation results from local directory."""
|
| 556 |
logger.info("🔍 Loading evaluation results...")
|
|
|
|
| 598 |
# Also load benchmark results
|
| 599 |
self.load_benchmark_results()
|
| 600 |
|
| 601 |
+
# Analyze actual model specifications for our models
|
| 602 |
+
self.analyze_our_model_specifications()
|
| 603 |
+
|
| 604 |
def _normalize_evaluation_data(self, data: dict, file_path: Path) -> dict[str, Any]:
|
| 605 |
"""Normalize evaluation data to consistent format for analysis."""
|
| 606 |
# Extract model name
|
|
|
|
| 849 |
# Define colors for each model
|
| 850 |
colors = ["rgb(255, 99, 132)", "rgb(54, 162, 235)", "rgb(255, 205, 86)", "rgb(75, 192, 192)"]
|
| 851 |
|
| 852 |
+
# Collect all scores to determine the appropriate range
|
| 853 |
+
all_scores = []
|
| 854 |
+
|
| 855 |
for i, model_result in enumerate(models_to_compare):
|
| 856 |
model_name = model_result["model_name"]
|
| 857 |
languages = model_result.get("languages", {})
|
|
|
|
| 865 |
if language_scores:
|
| 866 |
languages_list = list(language_scores.keys())
|
| 867 |
scores_list = list(language_scores.values())
|
| 868 |
+
all_scores.extend(scores_list) # Collect scores for range calculation
|
| 869 |
|
| 870 |
# Close the radar chart
|
| 871 |
languages_closed = [*languages_list, languages_list[0]]
|
|
|
|
| 886 |
)
|
| 887 |
)
|
| 888 |
|
| 889 |
+
# Calculate dynamic range based on actual data
|
| 890 |
+
if all_scores:
|
| 891 |
+
max_score = max(all_scores)
|
| 892 |
+
# Set range to slightly above the maximum score with some padding
|
| 893 |
+
range_max = min(1.0, max_score * 1.1) # Cap at 1.0 since NDCG@10 max is 1.0
|
| 894 |
+
else:
|
| 895 |
+
range_max = 1.0 # Default fallback
|
| 896 |
+
|
| 897 |
fig.update_layout(
|
| 898 |
+
polar={"radialaxis": {"visible": True, "range": [0, range_max]}},
|
| 899 |
showlegend=True,
|
| 900 |
title="Model Comparison: Best Distilled vs Top Peer Models",
|
| 901 |
width=900,
|
|
|
|
| 1306 |
# Safe conversion to float for pandas values
|
| 1307 |
score_value = pd.to_numeric(current_model_score, errors="coerce")
|
| 1308 |
scores.append(float(score_value) if not pd.isna(score_value) else 0.0)
|
| 1309 |
+
param_value = MODEL_SPECS[model_key].get("parameters", 100.0)
|
| 1310 |
+
params.append(float(param_value) if isinstance(param_value, (int, float)) else 100.0)
|
| 1311 |
is_user_model.append(False)
|
| 1312 |
|
| 1313 |
if not models:
|
|
|
|
| 1386 |
|
| 1387 |
return str(output_path)
|
| 1388 |
|
| 1389 |
+
def plot_model_specifications(self, save_path: str | None = None) -> str:
|
| 1390 |
+
"""Create visualization of our model specifications."""
|
| 1391 |
+
if not self.model_specs:
|
| 1392 |
+
logger.warning("No model specifications available for plotting")
|
| 1393 |
+
return ""
|
| 1394 |
+
|
| 1395 |
+
# Filter only successfully analyzed models
|
| 1396 |
+
successful_specs = {k: v for k, v in self.model_specs.items() if v.get("analysis_successful", False)}
|
| 1397 |
+
|
| 1398 |
+
if not successful_specs:
|
| 1399 |
+
logger.warning("No successfully analyzed models for plotting")
|
| 1400 |
+
return ""
|
| 1401 |
+
|
| 1402 |
+
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
|
| 1403 |
+
fig.suptitle("Our Distilled Models - Specifications Analysis", fontsize=16, fontweight="bold")
|
| 1404 |
+
|
| 1405 |
+
# Extract data
|
| 1406 |
+
model_names = list(successful_specs.keys())
|
| 1407 |
+
# Shorten model names for better display
|
| 1408 |
+
display_names = [name.replace("code_model2vec_", "").replace("_", " ") for name in model_names]
|
| 1409 |
+
vocab_sizes = [spec["vocabulary_size"] for spec in successful_specs.values()]
|
| 1410 |
+
param_counts = [spec["parameters_millions"] for spec in successful_specs.values()]
|
| 1411 |
+
embed_dims = [spec["embedding_dimensions"] for spec in successful_specs.values()]
|
| 1412 |
+
disk_sizes = [spec["disk_size_mb"] for spec in successful_specs.values()]
|
| 1413 |
+
|
| 1414 |
+
# 1. Vocabulary Size Comparison
|
| 1415 |
+
axes[0, 0].barh(display_names, vocab_sizes, color="skyblue")
|
| 1416 |
+
axes[0, 0].set_title("Vocabulary Size")
|
| 1417 |
+
axes[0, 0].set_xlabel("Number of Tokens")
|
| 1418 |
+
for i, v in enumerate(vocab_sizes):
|
| 1419 |
+
axes[0, 0].text(v + max(vocab_sizes) * 0.01, i, f"{v:,}", va="center", fontsize=9)
|
| 1420 |
+
|
| 1421 |
+
# 2. Parameter Count Comparison
|
| 1422 |
+
axes[0, 1].barh(display_names, param_counts, color="lightgreen")
|
| 1423 |
+
axes[0, 1].set_title("Model Parameters")
|
| 1424 |
+
axes[0, 1].set_xlabel("Parameters (Millions)")
|
| 1425 |
+
for i, v in enumerate(param_counts):
|
| 1426 |
+
axes[0, 1].text(v + max(param_counts) * 0.01, i, f"{v:.1f}M", va="center", fontsize=9)
|
| 1427 |
+
|
| 1428 |
+
# 3. Embedding Dimensions
|
| 1429 |
+
axes[1, 0].barh(display_names, embed_dims, color="lightsalmon")
|
| 1430 |
+
axes[1, 0].set_title("Embedding Dimensions")
|
| 1431 |
+
axes[1, 0].set_xlabel("Dimensions")
|
| 1432 |
+
for i, v in enumerate(embed_dims):
|
| 1433 |
+
axes[1, 0].text(v + max(embed_dims) * 0.01, i, f"{v}", va="center", fontsize=9)
|
| 1434 |
+
|
| 1435 |
+
# 4. Disk Size
|
| 1436 |
+
axes[1, 1].barh(display_names, disk_sizes, color="plum")
|
| 1437 |
+
axes[1, 1].set_title("Model Size on Disk")
|
| 1438 |
+
axes[1, 1].set_xlabel("Size (MB)")
|
| 1439 |
+
for i, v in enumerate(disk_sizes):
|
| 1440 |
+
axes[1, 1].text(v + max(disk_sizes) * 0.01, i, f"{v:.1f}MB", va="center", fontsize=9)
|
| 1441 |
+
|
| 1442 |
+
plt.tight_layout()
|
| 1443 |
+
|
| 1444 |
+
output_path = save_path or str(self.images_dir / "model_specifications.png")
|
| 1445 |
+
plt.savefig(output_path, dpi=300, bbox_inches="tight")
|
| 1446 |
+
plt.close()
|
| 1447 |
+
|
| 1448 |
+
return output_path
|
| 1449 |
+
|
| 1450 |
def generate_comprehensive_report(self, model_name: str = "Simplified Distillation Models") -> str:
|
| 1451 |
"""Generate comprehensive markdown report for all evaluated models."""
|
| 1452 |
if not self.results:
|
|
|
|
| 1495 |
heatmap_chart = self.plot_language_heatmap()
|
| 1496 |
peer_chart = self.create_peer_comparison_chart(main_model_name)
|
| 1497 |
efficiency_chart = self.create_efficiency_analysis(main_model_name)
|
| 1498 |
+
model_specs_chart = self.plot_model_specifications()
|
| 1499 |
|
| 1500 |
# Generate individual radar charts for all simplified models
|
| 1501 |
individual_radar_charts = self.create_individual_radar_charts(simplified_models)
|
|
|
|
| 1563 |
|
| 1564 |
report += f"| {model_display} | {teacher_display} | {overall_metrics.get('ndcg@10', 0):.4f} | {overall_metrics.get('mrr', 0):.4f} | {overall_metrics.get('recall@5', 0):.4f} | {status} |\n"
|
| 1565 |
|
| 1566 |
+
# Add model specifications section
|
| 1567 |
+
if self.model_specs:
|
| 1568 |
+
successful_specs = {k: v for k, v in self.model_specs.items() if v.get("analysis_successful", False)}
|
| 1569 |
+
if successful_specs:
|
| 1570 |
+
report += f"""
|
| 1571 |
+
|
| 1572 |
+
### 📊 Model Specifications Analysis
|
| 1573 |
+
|
| 1574 |
+
Our distilled models exhibit consistent architectural characteristics across different teacher models:
|
| 1575 |
+
|
| 1576 |
+
| Model | Vocabulary Size | Parameters | Embedding Dim | Disk Size |
|
| 1577 |
+
|-------|----------------|------------|---------------|-----------|
|
| 1578 |
+
"""
|
| 1579 |
+
|
| 1580 |
+
# Sort models by performance for consistency
|
| 1581 |
+
for result in simplified_models_sorted:
|
| 1582 |
+
model_display = result["model_name"]
|
| 1583 |
+
if model_display in successful_specs:
|
| 1584 |
+
spec = successful_specs[model_display]
|
| 1585 |
+
vocab_size = spec["vocabulary_size"]
|
| 1586 |
+
params_m = spec["parameters_millions"]
|
| 1587 |
+
embed_dim = spec["embedding_dimensions"]
|
| 1588 |
+
disk_size = spec["disk_size_mb"]
|
| 1589 |
+
|
| 1590 |
+
report += f"| {model_display.replace('code_model2vec_', '')} | {vocab_size:,} | {params_m:.1f}M | {embed_dim} | {disk_size:.1f}MB |\n"
|
| 1591 |
+
|
| 1592 |
+
if model_specs_chart:
|
| 1593 |
+
report += f"""
|
| 1594 |
+
|
| 1595 |
+

|
| 1596 |
+
|
| 1597 |
+
*Comprehensive analysis of our distilled models showing vocabulary size, parameter count, embedding dimensions, and storage requirements.*
|
| 1598 |
+
|
| 1599 |
+
#### Key Insights from Model Specifications:
|
| 1600 |
+
|
| 1601 |
+
"""
|
| 1602 |
+
# Calculate some insights
|
| 1603 |
+
vocab_sizes = [spec["vocabulary_size"] for spec in successful_specs.values()]
|
| 1604 |
+
param_counts = [spec["parameters_millions"] for spec in successful_specs.values()]
|
| 1605 |
+
embed_dims = [spec["embedding_dimensions"] for spec in successful_specs.values()]
|
| 1606 |
+
disk_sizes = [spec["disk_size_mb"] for spec in successful_specs.values()]
|
| 1607 |
+
|
| 1608 |
+
if vocab_sizes:
|
| 1609 |
+
avg_vocab = sum(vocab_sizes) / len(vocab_sizes)
|
| 1610 |
+
avg_params = sum(param_counts) / len(param_counts)
|
| 1611 |
+
avg_disk = sum(disk_sizes) / len(disk_sizes)
|
| 1612 |
+
|
| 1613 |
+
report += f"""
|
| 1614 |
+
- **Vocabulary Consistency**: All models use vocabulary sizes ranging from {min(vocab_sizes):,} to {max(vocab_sizes):,} tokens (avg: {avg_vocab:,.0f})
|
| 1615 |
+
- **Parameter Efficiency**: Models range from {min(param_counts):.1f}M to {max(param_counts):.1f}M parameters (avg: {avg_params:.1f}M)
|
| 1616 |
+
- **Storage Efficiency**: Disk usage ranges from {min(disk_sizes):.1f}MB to {max(disk_sizes):.1f}MB (avg: {avg_disk:.1f}MB)
|
| 1617 |
+
- **Embedding Dimensions**: Consistent {embed_dims[0]} dimensions across all models (optimized for efficiency)
|
| 1618 |
+
"""
|
| 1619 |
+
|
| 1620 |
report += """
|
| 1621 |
|
| 1622 |
### Key Findings
|
|
|
|
| 1648 |
report += f"\n\n"
|
| 1649 |
report += "*Comparative view showing how the best simplified distillation model performs against top peer models across programming languages.*\n\n"
|
| 1650 |
|
| 1651 |
+
# Add individual radar charts for all simplified models (sorted by performance)
|
| 1652 |
if individual_radar_charts:
|
| 1653 |
report += "### Individual Model Performance by Language\n\n"
|
|
|
|
|
|
|
|
|
|
| 1654 |
|
| 1655 |
+
# Sort the radar charts by model performance (best to worst)
|
| 1656 |
+
for result in simplified_models_sorted:
|
| 1657 |
+
chart_model_name = result["model_name"]
|
| 1658 |
+
if chart_model_name in individual_radar_charts:
|
| 1659 |
+
chart_path = individual_radar_charts[chart_model_name]
|
| 1660 |
+
|
| 1661 |
+
# Extract teacher name for cleaner display
|
| 1662 |
+
teacher_name, teacher_link = get_teacher_model_info(chart_model_name)
|
| 1663 |
+
|
| 1664 |
+
# Use linked teacher name if available
|
| 1665 |
+
teacher_display = f"[{teacher_name}]({teacher_link})" if teacher_link else teacher_name
|
| 1666 |
+
|
| 1667 |
+
# Get performance for display
|
| 1668 |
+
overall_metrics = result.get("overall", {})
|
| 1669 |
+
ndcg_score = overall_metrics.get("ndcg@10", 0)
|
| 1670 |
|
| 1671 |
+
report += f"#### {chart_model_name} (Teacher: {teacher_display}) - NDCG@10: {ndcg_score:.4f}\n\n"
|
| 1672 |
+
report += f"\n\n"
|
| 1673 |
|
| 1674 |
report += f"""
|
| 1675 |
|