Improve model card: Add pipeline tag, library, correct license, paper abstract, and usage example
Browse filesThis PR significantly enhances the model card for DataMind-Qwen2.5-7B by:
- **Updating metadata:**
- Changing the `license` from `mit` to `apache-2.0` as indicated in the official GitHub repository.
- Adding `pipeline_tag: text-generation` to ensure proper categorization and discoverability for this data analysis and code generation model.
- Adding `library_name: transformers` to enable the "Use in Transformers" widget, providing easy access to inference code.
- Adding relevant `tags` such as `data-analysis`, `code-generation`, and `qwen`.
- **Enriching content:**
- Adding the paper title and its Hugging Face link for quick reference.
- Including the paper abstract to provide a comprehensive overview of the model's research context and findings.
- Adding a direct link to the GitHub repository.
- Adding a "Usage" section with a practical Python code example for text generation (specifically for data analysis queries) using the `transformers` library.
These improvements make the model card more informative, discoverable, and user-friendly on the Hugging Face Hub.
|
@@ -1,12 +1,23 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
base_model:
|
| 4 |
- Qwen/Qwen2.5-7B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
|
|
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
## 🔧 Installation
|
| 12 |
|
|
@@ -46,7 +57,64 @@ conda activate DataMind
|
|
| 46 |
pip install -r requirements.txt
|
| 47 |
```
|
| 48 |
|
|
|
|
|
|
|
|
|
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## 🧐 Evaluation
|
| 52 |
|
|
@@ -58,9 +126,9 @@ pip install -r requirements.txt
|
|
| 58 |
|
| 59 |
**Step 1: Prepare the parameter configuration**
|
| 60 |
|
| 61 |
-
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench).
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
Here is the example:
|
| 66 |
**`config.yaml`**
|
|
@@ -111,10 +179,6 @@ Run the shell script to start the process.
|
|
| 111 |
bash run_eval.sh
|
| 112 |
```
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
## ✍️ Citation
|
| 119 |
|
| 120 |
If you find our work helpful, please use the following citations.
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-7B-Instruct
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- data-analysis
|
| 9 |
+
- code-generation
|
| 10 |
+
- qwen
|
| 11 |
---
|
| 12 |
|
| 13 |
+
This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
|
| 14 |
|
| 15 |
+
**Paper Abstract:**
|
| 16 |
+
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
|
| 17 |
+
|
| 18 |
+
For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind).
|
| 19 |
|
| 20 |
+
<h1 align="center"> ✨ DataMind </h1>
|
| 21 |
|
| 22 |
## 🔧 Installation
|
| 23 |
|
|
|
|
| 57 |
pip install -r requirements.txt
|
| 58 |
```
|
| 59 |
|
| 60 |
+
## Usage (Text Generation for Data Analysis)
|
| 61 |
+
|
| 62 |
+
You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks.
|
| 63 |
|
| 64 |
+
First, ensure you have the `transformers` library installed:
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
pip install transformers torch
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
Then, you can load and use the model as follows:
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 74 |
+
import torch
|
| 75 |
+
|
| 76 |
+
model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available
|
| 77 |
+
|
| 78 |
+
# Load the model and tokenizer
|
| 79 |
+
# Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs
|
| 80 |
+
# Use device_map="auto" to automatically distribute the model across available devices
|
| 81 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 82 |
+
model_name,
|
| 83 |
+
torch_dtype=torch.bfloat16,
|
| 84 |
+
device_map="auto",
|
| 85 |
+
trust_remote_code=True,
|
| 86 |
+
)
|
| 87 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 88 |
+
|
| 89 |
+
# Example: Generate Python code for data analysis
|
| 90 |
+
messages = [
|
| 91 |
+
{"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."}
|
| 92 |
+
]
|
| 93 |
+
|
| 94 |
+
# Apply chat template for Qwen models
|
| 95 |
+
text = tokenizer.apply_chat_template(
|
| 96 |
+
messages,
|
| 97 |
+
tokenize=False,
|
| 98 |
+
add_generation_prompt=True
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 102 |
+
|
| 103 |
+
# Generate response
|
| 104 |
+
generated_ids = model.generate(
|
| 105 |
+
model_inputs.input_ids,
|
| 106 |
+
max_new_tokens=512,
|
| 107 |
+
do_sample=True,
|
| 108 |
+
temperature=0.7,
|
| 109 |
+
top_p=0.8,
|
| 110 |
+
repetition_penalty=1.05,
|
| 111 |
+
eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
# Decode and print the generated text
|
| 115 |
+
response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0]
|
| 116 |
+
print(response)
|
| 117 |
+
```
|
| 118 |
|
| 119 |
## 🧐 Evaluation
|
| 120 |
|
|
|
|
| 126 |
|
| 127 |
**Step 1: Prepare the parameter configuration**
|
| 128 |
|
| 129 |
+
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
|
| 130 |
|
| 131 |
+
You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
|
| 132 |
|
| 133 |
Here is the example:
|
| 134 |
**`config.yaml`**
|
|
|
|
| 179 |
bash run_eval.sh
|
| 180 |
```
|
| 181 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
## ✍️ Citation
|
| 183 |
|
| 184 |
If you find our work helpful, please use the following citations.
|