sarthak1
/

codemalt

 ---
+base_model: Alibaba-NLP/gte-Qwen2-7B-instruct
+library_name: model2vec
 license: apache-2.0
+license_name: apache-2.0
+license_link: LICENSE
+model_name: gte-Qwen2-7B-instruct-M2V-Distilled
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- transformers
+- Qwen2
 ---
+# gte-Qwen2-7B-instruct-M2V-Distilled
+This project optimizes the gte-Qwen2-7B-instruct model using Model2Vec, reducing its size and dramatically improving inference speed while maintaining most of its performance capabilities.
+## Overview
+[gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) is a state-of-the-art embedding model designed for retrieval tasks. While powerful, it can be resource-intensive for production use cases.
+[Model2Vec](https://github.com/MinishLab/model2vec) is a technique to distill large sentence transformer models into small, fast static embedding models. This project applies Model2Vec to create an optimized version of gte-Qwen2-7B-instruct with the following benefits:
+- **Smaller Size**: Reduces model size by a factor of 180x
+- **Faster Inference**: Up to 15,021x faster inference
+- **Low Resource Requirements**: Minimal memory footprint and dependencies
+- **Maintains Performance**: Retains 86.56% of the original model's embedding similarity
+## Model Information
+- **Model Name**: gte-Qwen2-7B-instruct-M2V-Distilled
+- **Original Model**: [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
+- **Distillation Method**: [Model2Vec](https://github.com/MinishLab/model2vec)
+- **Original Dimensions**: 3584
+- **Distilled Dimensions**: 256
+- **Embedding Similarity**: 86.56% maintained with original model
+- **Size Reduction**: 180x (from 28.7GB to 158.98MB)
+- **Speed Improvement**: 15,021x faster (0.50 → 7,549 texts/second)
+## Installation
+First, ensure you have the required dependencies:
+```bash
+# Install the base package
+uv sync
+```
+## Usage
+### Distillation
+To create a distilled version of Alibaba-NLP/gte-Qwen2-7B-instruct:
+```bash
+uv run python distill.py
+```
+### Evaluation
+To evaluate the distilled model against the original:
+```bash
+uv run python evaluate.py
+```
+### Training Code Classification
+To train a programming language classifier using the distilled model on the CodeSearchNet dataset:
+```bash
+uv run python train_code_classification.py
+```
+This script:
+- Uses the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) for training
+- Trains a classifier to distinguish between 6 programming languages: Python, Java, JavaScript, Go, PHP, and Ruby
+- Creates a `StaticModelForClassification` using the distilled model
+- Evaluates the classifier and saves the trained model.
+**Dataset Details:**
+- **Source**: `code-search-net/code_search_net` from HuggingFace
+- **Task**: Programming language classification
+- **Languages**: Python, Java, JavaScript, Go, PHP, Ruby
+- **Max samples per language**: 5,000 (for balanced training)
+- **Code length range**: 50-2,000 characters
+- **Features**: Function code strings with language labels
+**Training Configuration:**
+- **Max epochs**: 30 with early stopping (patience: 5)
+- **Batch size**: 32
+- **Learning rate**: 1e-3
+- **Output**: Scikit-learn compatible pipeline saved to the root dir
+## Results
+The distilled model achieves remarkable performance improvements:
+- **180x reduction in model size** (from 28.7GB to 158.98MB)
+- **15,021x increase in inference speed** (0.50 → 7,549 texts/second)
+- **86.56% embedding similarity** maintained with the original model
+- **14x dimensional reduction** (3584 → 256 dimensions)
+- **Significant memory efficiency** with minimal resource requirements
+### Performance Visualizations
+#### Model Size Comparison
+![Model Size Comparison](evaluation/size_comparison.png)
+*Dramatic reduction in model size from 28.7GB to 158.98MB*
+#### Inference Speed Comparison
+![Speed Comparison](evaluation/speed_comparison.png)
+*15,021x faster inference speed: from 0.50 to 7,549 texts per second*
+#### Memory Usage Comparison
+![Memory Comparison](evaluation/memory_comparison.png)
+*Significant reduction in memory footprint during inference*
+#### Embedding Similarity Analysis
+![Similarity Matrix](evaluation/similarity_matrix.png)
+*High correlation (86.56%) between original and distilled model embeddings*
+Detailed evaluation results, including similarity plots and performance metrics, are saved to the evaluation output directory.
+## Project Structure
+- `distill.py` - Script to create the distilled model
+- `evaluate.py` - Script to compare performance with the original model
+- `train_code_classification.py` - Script to train programming language classifier
+- `MTEB_evaluate.py` - Script to evaluate model on MTEB benchmark tasks
+- `evaluation/` - Directory containing evaluation results and visualizations
+- `trained_code_classifier/` - Directory containing trained classification model
+- `mteb_results/` - Directory containing MTEB evaluation results
+## Acknowledgments
+This project is built upon the following technologies:
+- [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) - The original embedding model developed by Alibaba-NLP
+- [Model2Vec](https://github.com/MinishLab/model2vec) - The distillation technique used to optimize the model
+## License
+This model is licensed under the [Apache 2.0](LICENSE) license, the same as the original gte-Qwen2-7B-instruct model.