Sarthak commited on
Commit
4a1942c
·
1 Parent(s): ecfceb8

docs: Enhance README.md with detailed project information for gte-Qwen2-7B-instruct-M2V-Distilled, including model optimization benefits, installation instructions, usage examples, and performance results.

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md CHANGED
@@ -1,3 +1,146 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Alibaba-NLP/gte-Qwen2-7B-instruct
3
+ library_name: model2vec
4
  license: apache-2.0
5
+ license_name: apache-2.0
6
+ license_link: LICENSE
7
+ model_name: gte-Qwen2-7B-instruct-M2V-Distilled
8
+ tags:
9
+ - sentence-transformers
10
+ - sentence-similarity
11
+ - feature-extraction
12
+ - transformers
13
+ - Qwen2
14
  ---
15
+
16
+ # gte-Qwen2-7B-instruct-M2V-Distilled
17
+
18
+ This project optimizes the gte-Qwen2-7B-instruct model using Model2Vec, reducing its size and dramatically improving inference speed while maintaining most of its performance capabilities.
19
+
20
+ ## Overview
21
+
22
+ [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) is a state-of-the-art embedding model designed for retrieval tasks. While powerful, it can be resource-intensive for production use cases.
23
+
24
+ [Model2Vec](https://github.com/MinishLab/model2vec) is a technique to distill large sentence transformer models into small, fast static embedding models. This project applies Model2Vec to create an optimized version of gte-Qwen2-7B-instruct with the following benefits:
25
+
26
+ - **Smaller Size**: Reduces model size by a factor of 180x
27
+ - **Faster Inference**: Up to 15,021x faster inference
28
+ - **Low Resource Requirements**: Minimal memory footprint and dependencies
29
+ - **Maintains Performance**: Retains 86.56% of the original model's embedding similarity
30
+
31
+ ## Model Information
32
+
33
+ - **Model Name**: gte-Qwen2-7B-instruct-M2V-Distilled
34
+ - **Original Model**: [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
35
+ - **Distillation Method**: [Model2Vec](https://github.com/MinishLab/model2vec)
36
+ - **Original Dimensions**: 3584
37
+ - **Distilled Dimensions**: 256
38
+ - **Embedding Similarity**: 86.56% maintained with original model
39
+ - **Size Reduction**: 180x (from 28.7GB to 158.98MB)
40
+ - **Speed Improvement**: 15,021x faster (0.50 → 7,549 texts/second)
41
+
42
+ ## Installation
43
+
44
+ First, ensure you have the required dependencies:
45
+
46
+ ```bash
47
+ # Install the base package
48
+ uv sync
49
+ ```
50
+
51
+ ## Usage
52
+
53
+ ### Distillation
54
+
55
+ To create a distilled version of Alibaba-NLP/gte-Qwen2-7B-instruct:
56
+
57
+ ```bash
58
+ uv run python distill.py
59
+ ```
60
+
61
+ ### Evaluation
62
+
63
+ To evaluate the distilled model against the original:
64
+
65
+ ```bash
66
+ uv run python evaluate.py
67
+ ```
68
+
69
+ ### Training Code Classification
70
+
71
+ To train a programming language classifier using the distilled model on the CodeSearchNet dataset:
72
+
73
+ ```bash
74
+ uv run python train_code_classification.py
75
+ ```
76
+
77
+ This script:
78
+ - Uses the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) for training
79
+ - Trains a classifier to distinguish between 6 programming languages: Python, Java, JavaScript, Go, PHP, and Ruby
80
+ - Creates a `StaticModelForClassification` using the distilled model
81
+ - Evaluates the classifier and saves the trained model.
82
+
83
+ **Dataset Details:**
84
+ - **Source**: `code-search-net/code_search_net` from HuggingFace
85
+ - **Task**: Programming language classification
86
+ - **Languages**: Python, Java, JavaScript, Go, PHP, Ruby
87
+ - **Max samples per language**: 5,000 (for balanced training)
88
+ - **Code length range**: 50-2,000 characters
89
+ - **Features**: Function code strings with language labels
90
+
91
+ **Training Configuration:**
92
+ - **Max epochs**: 30 with early stopping (patience: 5)
93
+ - **Batch size**: 32
94
+ - **Learning rate**: 1e-3
95
+ - **Output**: Scikit-learn compatible pipeline saved to the root dir
96
+
97
+ ## Results
98
+
99
+ The distilled model achieves remarkable performance improvements:
100
+
101
+ - **180x reduction in model size** (from 28.7GB to 158.98MB)
102
+ - **15,021x increase in inference speed** (0.50 → 7,549 texts/second)
103
+ - **86.56% embedding similarity** maintained with the original model
104
+ - **14x dimensional reduction** (3584 → 256 dimensions)
105
+ - **Significant memory efficiency** with minimal resource requirements
106
+
107
+ ### Performance Visualizations
108
+
109
+ #### Model Size Comparison
110
+ ![Model Size Comparison](evaluation/size_comparison.png)
111
+ *Dramatic reduction in model size from 28.7GB to 158.98MB*
112
+
113
+ #### Inference Speed Comparison
114
+ ![Speed Comparison](evaluation/speed_comparison.png)
115
+ *15,021x faster inference speed: from 0.50 to 7,549 texts per second*
116
+
117
+ #### Memory Usage Comparison
118
+ ![Memory Comparison](evaluation/memory_comparison.png)
119
+ *Significant reduction in memory footprint during inference*
120
+
121
+ #### Embedding Similarity Analysis
122
+ ![Similarity Matrix](evaluation/similarity_matrix.png)
123
+ *High correlation (86.56%) between original and distilled model embeddings*
124
+
125
+ Detailed evaluation results, including similarity plots and performance metrics, are saved to the evaluation output directory.
126
+
127
+ ## Project Structure
128
+
129
+ - `distill.py` - Script to create the distilled model
130
+ - `evaluate.py` - Script to compare performance with the original model
131
+ - `train_code_classification.py` - Script to train programming language classifier
132
+ - `MTEB_evaluate.py` - Script to evaluate model on MTEB benchmark tasks
133
+ - `evaluation/` - Directory containing evaluation results and visualizations
134
+ - `trained_code_classifier/` - Directory containing trained classification model
135
+ - `mteb_results/` - Directory containing MTEB evaluation results
136
+
137
+ ## Acknowledgments
138
+
139
+ This project is built upon the following technologies:
140
+
141
+ - [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) - The original embedding model developed by Alibaba-NLP
142
+ - [Model2Vec](https://github.com/MinishLab/model2vec) - The distillation technique used to optimize the model
143
+
144
+ ## License
145
+
146
+ This model is licensed under the [Apache 2.0](LICENSE) license, the same as the original gte-Qwen2-7B-instruct model.