Improve language tag

#2
by lbourdois - opened
Files changed (1) hide show
  1. README.md +231 -219
README.md CHANGED
@@ -1,220 +1,232 @@
1
- ---
2
- library_name: transformers
3
- license: mit
4
- base_model:
5
- - Qwen/Qwen2.5-3B-Instruct
6
- language:
7
- - en
8
- ---
9
-
10
- # Model Card for EQuIP-Queries/EQuIP_3B
11
-
12
- An AI model that understands natural language and translates it into accurate Elasticsearch queries.
13
- This model is based on the Qwen2.5 3B architecture, a compact yet powerful language model known for its efficiency.
14
- We fine-tuned this model with 10,000 Elasticsearch query data points to specialize its ability to generate accurate and relevant queries.
15
-
16
-
17
- ## Model Details
18
-
19
- ### Model Description
20
-
21
- Our Solution: An AI-Powered Query Generator
22
- Our team has developed a solution to this challenge: an AI model that understands natural language and translates it into accurate Elasticsearch queries. This model is based on the Qwen2.5 3B architecture, a compact yet powerful language model known for its efficiency. We fine-tuned this model with 10,000 Elasticsearch query data points to specialize its ability to generate accurate and relevant queries.
23
- We've employed advanced techniques, including LoRA (Low-Rank Adaptation) to optimize the model for performance and efficiency. Specifically, LoRA reduces the number of trainable parameters by introducing low-rank matrices.
24
- This combination allows us to achieve high accuracy while minimizing computational resource requirements.
25
-
26
-
27
- Key Features and Benefits
28
-
29
- Natural Language Interface: Users can simply describe the data they're looking for in plain English, and the model will generate the corresponding Elasticsearch query.
30
- Increased Efficiency: Reduces the time and effort required to write complex queries, allowing users to focus on analyzing their data.
31
- Improved Accessibility: Makes Elasticsearch more accessible to a wider audience, including those who are not experts in its query language.
32
- Open Source: We are committed to open source and believe in the power of community-driven innovation. By making our model open source, we aim to contribute to the advancement of AI and empower others to build upon our work. We recognize the lack of readily available solutions in this specific area, and we're excited to fill that gap.
33
- Future Developments: This is just the beginning. Our team is dedicated to pushing the boundaries of what's possible with AI, and we have plans to release further updates and enhancements to this model in the future. We are committed to continuous improvement and innovation in the field of AI-powered search.
34
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
35
-
36
- - **Developed by:** EQuIP
37
- - **Funded by :** EQuIP
38
- - **Model type:** Causal Language Model
39
- - **Language(s) (NLP):** English (en)
40
- - **License:** MIT License
41
- - **Finetuned from model :** Qwen2.5-3B-Instruct
42
-
43
- ### Model Sources [optional]
44
-
45
- - **Repository:** https://huggingface.co/EQuIP-Queries/EQuIP_3B
46
-
47
- ## Uses
48
-
49
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
50
-
51
- ### Direct Use
52
- This model is intended to be directly used to translate natural language prompts into Elasticsearch queries without additional fine-tuning.
53
-
54
- Example use cases include:
55
-
56
- Generating Elasticsearch queries from plain English prompts.
57
- Simplifying query generation for analysts, developers, or data scientists unfamiliar with Elasticsearch syntax.
58
- Automating query creation as part of search, analytics, or data exploration tools.
59
-
60
- Intended users:
61
- Developers integrating natural language querying capabilities into Elasticsearch-based applications.
62
- Analysts and data scientists who frequently interact with Elasticsearch data.
63
-
64
-
65
- ### Out-of-Scope Use
66
-
67
- The model is not intended for use cases such as:
68
-
69
- Generating queries for databases or search engines other than Elasticsearch.
70
- Handling languages other than English.
71
- Providing factual answers or general conversational interactions.
72
- Tasks involving sensitive decision-making, such as medical, legal, or financial advice, where inaccurate queries may lead to significant consequences.
73
-
74
- ## Bias, Risks, and Limitations
75
-
76
- Bias Awareness:
77
- - The model may inherit biases present in the training data. Users should assess generated outputs for unintended biases or patterns, particularly in sensitive contexts.
78
-
79
- Misuse and Malicious Use:
80
- - Users must avoid using the model to intentionally produce harmful or misleading search queries or manipulate search results negatively.
81
-
82
- Limitations:
83
- - Performance may degrade significantly if input prompts differ substantially from the fine-tuning data domain.
84
- - The model does not validate query accuracy or safety and should be reviewed before execution, especially in production environments.
85
-
86
- ### Recommendations
87
-
88
- Query Validation:
89
- - Always validate and test generated Elasticsearch queries before deploying in production or using on sensitive data. Automatic generation may occasionally result in syntactic or semantic inaccuracies.
90
- Bias Awareness:
91
- - The model may inherit biases present in the training data. Users should assess generated outputs for unintended biases or patterns, particularly in sensitive contexts.
92
- Use in Sensitive Contexts:
93
- - Avoid using this model for critical or high-stakes decision-making tasks without additional human oversight and validation.
94
- Continuous Monitoring:
95
- - Monitor the outputs regularly to identify and correct issues promptly, ensuring long-term reliability.
96
- Transparency:
97
- - Clearly communicate the AI-driven nature of generated Elasticsearch queries to end-users to manage expectations and encourage verification.
98
-
99
-
100
-
101
-
102
- ## How to Get Started with the Model
103
-
104
- Install the required dependencies:
105
- ```python
106
- [pip install transformers torch]
107
- ```
108
- Here's how you can quickly start generating Elasticsearch queries from natural language prompts using this model:
109
- ```python
110
- from transformers import AutoModelForCausalLM, AutoTokenizer
111
-
112
- model_name = "EQuIP-Queries/EQuIP_3B"
113
- tokenizer = AutoTokenizer.from_pretrained(model_name)
114
- model = AutoModelForCausalLM.from_pretrained(model_name)
115
-
116
- mapping = "[Your Elasticsearch mappings]"
117
- user_request = "Find me products which are less than $50"
118
-
119
- prompt = f"Given this mapping: {mapping}\nGenerate an Elasticsearch query for: {user_request}"
120
-
121
- inputs = tokenizer(prompt, return_tensors="pt")
122
-
123
- outputs = model.generate(
124
- inputs["input_ids"],
125
- max_length=512,
126
- do_sample=True,
127
- temperature=0.2,
128
- top_p=0.9,
129
- pad_token_id=tokenizer.pad_token_id
130
- )
131
-
132
- generated_query = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
133
- print("Generated Elasticsearch query:")
134
- print(generated_query)
135
- ```
136
-
137
- ## Training Details
138
-
139
- ### Training Data
140
-
141
- The model was fine-tuned on a custom dataset consisting of 10,000 pairs of natural language prompts and corresponding Elasticsearch queries. Each prompt describes the desired Elasticsearch query in plain English, paired with a manually crafted accurate Elasticsearch query.
142
-
143
- The dataset covers various query types and common Elasticsearch query patterns, including filters, range queries, aggregations, boolean conditions, and text search scenarios.
144
-
145
- Currently, the dataset is not publicly available. If made available in the future, a Dataset Card link will be provided here.
146
-
147
- Preprocessing:
148
- - Prompts and queries were cleaned to ensure consistent formatting.
149
- - Special tokens and unnecessary whitespace were removed to ensure high-quality training data.
150
-
151
- ### Training Procedure
152
-
153
- The model was fine-tuned using Low-Rank Adaptation (LoRA) on top of the pre-trained Qwen2.5-3B-Instruct model. LoRA significantly reduced computational requirements by training only low-rank matrices within the Transformer layers.
154
-
155
-
156
-
157
- #### Training Hyperparameters
158
-
159
- - **Training regime:** bf16 non-mixed precision
160
-
161
- ## Evaluation
162
-
163
- The model was evaluated using a held-out test set comprising 1,000 prompt-query pairs not included in the training dataset. The primary goal of the evaluation was to measure the accuracy and relevance of generated Elasticsearch queries.
164
-
165
- ### Testing Data, Factors & Metrics
166
-
167
- #### Testing Data
168
-
169
- - Size: 1,000 prompt-query pairs (held-out from training).
170
- - Composition: Representative of diverse Elasticsearch query types, including boolean conditions, aggregations, text search, and date-based queries.
171
-
172
- #### Factors
173
-
174
- The evaluation considered:
175
- - Complexity of the Elasticsearch query.
176
- - Accuracy in interpreting the intent of natural language prompts.
177
- - Syntactic correctness and relevance of generated queries.
178
-
179
- #### Metrics
180
-
181
- Exact Match: Measures the percentage of queries matching exactly with ground truth queries.
182
- Semantic Similarity: Assessed using embedding-based similarity scores (e.g., cosine similarity).
183
- Token-level F1: Evaluates precision and recall at the token-level, measuring partial correctness in generated queries.
184
-
185
- ### Results
186
-
187
- | Model | Parameters | Generation Time (sec) | Token Precision | Token Recall | Token F1 | Validity Rate | Field Similarity |
188
- |--------------------|------------|-----------------------|-----------------|--------------|----------|---------------|------------------|
189
- | **EQuIP** | 3B | 0.7969 | 0.8738 | 0.9737 | 0.9808 | 0.97 | 0.9916 |
190
- | **LLaMA 3.1** | 8B | 13.4822 | 0.3979 | 0.6 | 0.5693 | 0.5723 | 0.4622 |
191
- | **Qwen 2.5** | 7B | 1.4233 | 0.6667 | 0.7 | 0.7743 | 0.82 | 0.6479 |
192
- | **Deepseek Distill** | 8B | 9.2516 | 0.5846 | 0.65 | 0.6979 | 0.7496 | 0.8908 |
193
- | **Gemma 2** | 9B | 3.0801 | 0.6786 | 0.82 | 0.7309 | 0.8 | 0.8151 |
194
- | **Mistral** | 7B | 2.1068 | 0.6786 | 0.79 | 0.7551 | 0.8 | 0.7437 |
195
-
196
-
197
- #### Summary
198
-
199
- The evaluation demonstrates that the model achieves strong performance in accurately translating natural language prompts into valid Elasticsearch queries. It shows particularly high effectiveness in terms of token precision, recall, and overall semantic similarity, highlighting its ability to generate accurate, relevant, and syntactically correct queries efficiently. Compared to several other widely-used models, it offers an excellent balance of accuracy, speed, and computational efficiency, making it highly suitable for production use in Elasticsearch query generation tasks. However, it's recommended that users continue to verify query outputs, especially for critical or sensitive applications.
200
-
201
- ## Environmental Impact
202
-
203
- Carbon emissions for the training and fine-tuning of this model can be estimated using the Machine Learning Impact calculator introduced by Lacoste et al. (2019).
204
-
205
- - **Hardware Type:** NVIDIA A100 GPU
206
- - **Hours used:** 11 hours
207
- - **Cloud Provider:** Vast.ai
208
-
209
-
210
- ### Model Architecture and Objective
211
-
212
- This model is based on the Qwen2.5-3B-Instruct architecture, which is a decoder-only, transformer-based causal language model. It consists of approximately 3 billion parameters designed for efficient and high-quality natural language understanding and generation.
213
-
214
- The primary objective of this fine-tuned model is to accurately convert natural language prompts into syntactically correct and semantically relevant Elasticsearch queries. To achieve this, the model was fine-tuned on domain-specific data, incorporating Low-Rank Adaptation (LoRA) to optimize performance and resource efficiency.
215
-
216
-
217
- ## Model Card Contact
218
-
219
- Contact: EQuIP
 
 
 
 
 
 
 
 
 
 
 
 
220
  Email: [info@equipqueries.com]
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ base_model:
5
+ - Qwen/Qwen2.5-3B-Instruct
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ ---
21
+
22
+ # Model Card for EQuIP-Queries/EQuIP_3B
23
+
24
+ An AI model that understands natural language and translates it into accurate Elasticsearch queries.
25
+ This model is based on the Qwen2.5 3B architecture, a compact yet powerful language model known for its efficiency.
26
+ We fine-tuned this model with 10,000 Elasticsearch query data points to specialize its ability to generate accurate and relevant queries.
27
+
28
+
29
+ ## Model Details
30
+
31
+ ### Model Description
32
+
33
+ Our Solution: An AI-Powered Query Generator
34
+ Our team has developed a solution to this challenge: an AI model that understands natural language and translates it into accurate Elasticsearch queries. This model is based on the Qwen2.5 3B architecture, a compact yet powerful language model known for its efficiency. We fine-tuned this model with 10,000 Elasticsearch query data points to specialize its ability to generate accurate and relevant queries.
35
+ We've employed advanced techniques, including LoRA (Low-Rank Adaptation) to optimize the model for performance and efficiency. Specifically, LoRA reduces the number of trainable parameters by introducing low-rank matrices.
36
+ This combination allows us to achieve high accuracy while minimizing computational resource requirements.
37
+
38
+
39
+ Key Features and Benefits
40
+
41
+ Natural Language Interface: Users can simply describe the data they're looking for in plain English, and the model will generate the corresponding Elasticsearch query.
42
+ Increased Efficiency: Reduces the time and effort required to write complex queries, allowing users to focus on analyzing their data.
43
+ Improved Accessibility: Makes Elasticsearch more accessible to a wider audience, including those who are not experts in its query language.
44
+ Open Source: We are committed to open source and believe in the power of community-driven innovation. By making our model open source, we aim to contribute to the advancement of AI and empower others to build upon our work. We recognize the lack of readily available solutions in this specific area, and we're excited to fill that gap.
45
+ Future Developments: This is just the beginning. Our team is dedicated to pushing the boundaries of what's possible with AI, and we have plans to release further updates and enhancements to this model in the future. We are committed to continuous improvement and innovation in the field of AI-powered search.
46
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
47
+
48
+ - **Developed by:** EQuIP
49
+ - **Funded by :** EQuIP
50
+ - **Model type:** Causal Language Model
51
+ - **Language(s) (NLP):** English (en)
52
+ - **License:** MIT License
53
+ - **Finetuned from model :** Qwen2.5-3B-Instruct
54
+
55
+ ### Model Sources [optional]
56
+
57
+ - **Repository:** https://huggingface.co/EQuIP-Queries/EQuIP_3B
58
+
59
+ ## Uses
60
+
61
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
62
+
63
+ ### Direct Use
64
+ This model is intended to be directly used to translate natural language prompts into Elasticsearch queries without additional fine-tuning.
65
+
66
+ Example use cases include:
67
+
68
+ Generating Elasticsearch queries from plain English prompts.
69
+ Simplifying query generation for analysts, developers, or data scientists unfamiliar with Elasticsearch syntax.
70
+ Automating query creation as part of search, analytics, or data exploration tools.
71
+
72
+ Intended users:
73
+ Developers integrating natural language querying capabilities into Elasticsearch-based applications.
74
+ Analysts and data scientists who frequently interact with Elasticsearch data.
75
+
76
+
77
+ ### Out-of-Scope Use
78
+
79
+ The model is not intended for use cases such as:
80
+
81
+ Generating queries for databases or search engines other than Elasticsearch.
82
+ Handling languages other than English.
83
+ Providing factual answers or general conversational interactions.
84
+ Tasks involving sensitive decision-making, such as medical, legal, or financial advice, where inaccurate queries may lead to significant consequences.
85
+
86
+ ## Bias, Risks, and Limitations
87
+
88
+ Bias Awareness:
89
+ - The model may inherit biases present in the training data. Users should assess generated outputs for unintended biases or patterns, particularly in sensitive contexts.
90
+
91
+ Misuse and Malicious Use:
92
+ - Users must avoid using the model to intentionally produce harmful or misleading search queries or manipulate search results negatively.
93
+
94
+ Limitations:
95
+ - Performance may degrade significantly if input prompts differ substantially from the fine-tuning data domain.
96
+ - The model does not validate query accuracy or safety and should be reviewed before execution, especially in production environments.
97
+
98
+ ### Recommendations
99
+
100
+ Query Validation:
101
+ - Always validate and test generated Elasticsearch queries before deploying in production or using on sensitive data. Automatic generation may occasionally result in syntactic or semantic inaccuracies.
102
+ Bias Awareness:
103
+ - The model may inherit biases present in the training data. Users should assess generated outputs for unintended biases or patterns, particularly in sensitive contexts.
104
+ Use in Sensitive Contexts:
105
+ - Avoid using this model for critical or high-stakes decision-making tasks without additional human oversight and validation.
106
+ Continuous Monitoring:
107
+ - Monitor the outputs regularly to identify and correct issues promptly, ensuring long-term reliability.
108
+ Transparency:
109
+ - Clearly communicate the AI-driven nature of generated Elasticsearch queries to end-users to manage expectations and encourage verification.
110
+
111
+
112
+
113
+
114
+ ## How to Get Started with the Model
115
+
116
+ Install the required dependencies:
117
+ ```python
118
+ [pip install transformers torch]
119
+ ```
120
+ Here's how you can quickly start generating Elasticsearch queries from natural language prompts using this model:
121
+ ```python
122
+ from transformers import AutoModelForCausalLM, AutoTokenizer
123
+
124
+ model_name = "EQuIP-Queries/EQuIP_3B"
125
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
126
+ model = AutoModelForCausalLM.from_pretrained(model_name)
127
+
128
+ mapping = "[Your Elasticsearch mappings]"
129
+ user_request = "Find me products which are less than $50"
130
+
131
+ prompt = f"Given this mapping: {mapping}\nGenerate an Elasticsearch query for: {user_request}"
132
+
133
+ inputs = tokenizer(prompt, return_tensors="pt")
134
+
135
+ outputs = model.generate(
136
+ inputs["input_ids"],
137
+ max_length=512,
138
+ do_sample=True,
139
+ temperature=0.2,
140
+ top_p=0.9,
141
+ pad_token_id=tokenizer.pad_token_id
142
+ )
143
+
144
+ generated_query = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
145
+ print("Generated Elasticsearch query:")
146
+ print(generated_query)
147
+ ```
148
+
149
+ ## Training Details
150
+
151
+ ### Training Data
152
+
153
+ The model was fine-tuned on a custom dataset consisting of 10,000 pairs of natural language prompts and corresponding Elasticsearch queries. Each prompt describes the desired Elasticsearch query in plain English, paired with a manually crafted accurate Elasticsearch query.
154
+
155
+ The dataset covers various query types and common Elasticsearch query patterns, including filters, range queries, aggregations, boolean conditions, and text search scenarios.
156
+
157
+ Currently, the dataset is not publicly available. If made available in the future, a Dataset Card link will be provided here.
158
+
159
+ Preprocessing:
160
+ - Prompts and queries were cleaned to ensure consistent formatting.
161
+ - Special tokens and unnecessary whitespace were removed to ensure high-quality training data.
162
+
163
+ ### Training Procedure
164
+
165
+ The model was fine-tuned using Low-Rank Adaptation (LoRA) on top of the pre-trained Qwen2.5-3B-Instruct model. LoRA significantly reduced computational requirements by training only low-rank matrices within the Transformer layers.
166
+
167
+
168
+
169
+ #### Training Hyperparameters
170
+
171
+ - **Training regime:** bf16 non-mixed precision
172
+
173
+ ## Evaluation
174
+
175
+ The model was evaluated using a held-out test set comprising 1,000 prompt-query pairs not included in the training dataset. The primary goal of the evaluation was to measure the accuracy and relevance of generated Elasticsearch queries.
176
+
177
+ ### Testing Data, Factors & Metrics
178
+
179
+ #### Testing Data
180
+
181
+ - Size: 1,000 prompt-query pairs (held-out from training).
182
+ - Composition: Representative of diverse Elasticsearch query types, including boolean conditions, aggregations, text search, and date-based queries.
183
+
184
+ #### Factors
185
+
186
+ The evaluation considered:
187
+ - Complexity of the Elasticsearch query.
188
+ - Accuracy in interpreting the intent of natural language prompts.
189
+ - Syntactic correctness and relevance of generated queries.
190
+
191
+ #### Metrics
192
+
193
+ Exact Match: Measures the percentage of queries matching exactly with ground truth queries.
194
+ Semantic Similarity: Assessed using embedding-based similarity scores (e.g., cosine similarity).
195
+ Token-level F1: Evaluates precision and recall at the token-level, measuring partial correctness in generated queries.
196
+
197
+ ### Results
198
+
199
+ | Model | Parameters | Generation Time (sec) | Token Precision | Token Recall | Token F1 | Validity Rate | Field Similarity |
200
+ |--------------------|------------|-----------------------|-----------------|--------------|----------|---------------|------------------|
201
+ | **EQuIP** | 3B | 0.7969 | 0.8738 | 0.9737 | 0.9808 | 0.97 | 0.9916 |
202
+ | **LLaMA 3.1** | 8B | 13.4822 | 0.3979 | 0.6 | 0.5693 | 0.5723 | 0.4622 |
203
+ | **Qwen 2.5** | 7B | 1.4233 | 0.6667 | 0.7 | 0.7743 | 0.82 | 0.6479 |
204
+ | **Deepseek Distill** | 8B | 9.2516 | 0.5846 | 0.65 | 0.6979 | 0.7496 | 0.8908 |
205
+ | **Gemma 2** | 9B | 3.0801 | 0.6786 | 0.82 | 0.7309 | 0.8 | 0.8151 |
206
+ | **Mistral** | 7B | 2.1068 | 0.6786 | 0.79 | 0.7551 | 0.8 | 0.7437 |
207
+
208
+
209
+ #### Summary
210
+
211
+ The evaluation demonstrates that the model achieves strong performance in accurately translating natural language prompts into valid Elasticsearch queries. It shows particularly high effectiveness in terms of token precision, recall, and overall semantic similarity, highlighting its ability to generate accurate, relevant, and syntactically correct queries efficiently. Compared to several other widely-used models, it offers an excellent balance of accuracy, speed, and computational efficiency, making it highly suitable for production use in Elasticsearch query generation tasks. However, it's recommended that users continue to verify query outputs, especially for critical or sensitive applications.
212
+
213
+ ## Environmental Impact
214
+
215
+ Carbon emissions for the training and fine-tuning of this model can be estimated using the Machine Learning Impact calculator introduced by Lacoste et al. (2019).
216
+
217
+ - **Hardware Type:** NVIDIA A100 GPU
218
+ - **Hours used:** 11 hours
219
+ - **Cloud Provider:** Vast.ai
220
+
221
+
222
+ ### Model Architecture and Objective
223
+
224
+ This model is based on the Qwen2.5-3B-Instruct architecture, which is a decoder-only, transformer-based causal language model. It consists of approximately 3 billion parameters designed for efficient and high-quality natural language understanding and generation.
225
+
226
+ The primary objective of this fine-tuned model is to accurately convert natural language prompts into syntactically correct and semantically relevant Elasticsearch queries. To achieve this, the model was fine-tuned on domain-specific data, incorporating Low-Rank Adaptation (LoRA) to optimize performance and resource efficiency.
227
+
228
+
229
+ ## Model Card Contact
230
+
231
+ Contact: EQuIP
232
  Email: [info@equipqueries.com]