|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B-Instruct |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
tags: |
|
|
- data-analysis |
|
|
- code-generation |
|
|
- qwen |
|
|
--- |
|
|
|
|
|
This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794). |
|
|
|
|
|
**Paper Abstract:** |
|
|
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. |
|
|
|
|
|
For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind). |
|
|
|
|
|
<h1 align="center"> ✨ DataMind </h1> |
|
|
|
|
|
## 🔧 Installation |
|
|
|
|
|
#### 🔩Manual Environment Configuration |
|
|
|
|
|
Conda virtual environments offer a light and flexible setup. |
|
|
|
|
|
**Prerequisites** |
|
|
|
|
|
- Anaconda Installation |
|
|
- GPU support (recommended CUDA version: 12.4) |
|
|
|
|
|
**Configure Steps** |
|
|
|
|
|
1. Clone the repository: |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/zjunlp/DataMind.git |
|
|
``` |
|
|
|
|
|
2. Enter the working directory, and all subsequent commands should be executed in this directory. |
|
|
|
|
|
```bash |
|
|
cd DataMind/eval |
|
|
``` |
|
|
|
|
|
3. Create a virtual environment using `Anaconda`. |
|
|
|
|
|
```bash |
|
|
conda create -n DataMind python=3.10 |
|
|
conda activate DataMind |
|
|
``` |
|
|
|
|
|
4. Install all required Python packages. |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## Usage (Text Generation for Data Analysis) |
|
|
|
|
|
You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks. |
|
|
|
|
|
First, ensure you have the `transformers` library installed: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
Then, you can load and use the model as follows: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available |
|
|
|
|
|
# Load the model and tokenizer |
|
|
# Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs |
|
|
# Use device_map="auto" to automatically distribute the model across available devices |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
|
|
# Example: Generate Python code for data analysis |
|
|
messages = [ |
|
|
{"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."} |
|
|
] |
|
|
|
|
|
# Apply chat template for Qwen models |
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate response |
|
|
generated_ids = model.generate( |
|
|
model_inputs.input_ids, |
|
|
max_new_tokens=512, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.8, |
|
|
repetition_penalty=1.05, |
|
|
eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token |
|
|
) |
|
|
|
|
|
# Decode and print the generated text |
|
|
response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0] |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## 🧐 Evaluation |
|
|
|
|
|
> Note: |
|
|
> |
|
|
> - **Ensure** that your working directory is set to the **`eval`** folder in a virtual environment. |
|
|
> - If you have more questions, feel free to open an issue with us. |
|
|
> - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**. |
|
|
|
|
|
**Step 1: Prepare the parameter configuration** |
|
|
|
|
|
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`. |
|
|
|
|
|
You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B). |
|
|
|
|
|
Here is the example: |
|
|
**`config.yaml`** |
|
|
|
|
|
```yaml |
|
|
api_key: your_api_key # your API key for the model with API service. No need for open-source models. |
|
|
data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path) |
|
|
``` |
|
|
|
|
|
**`run_eval.sh`** |
|
|
|
|
|
```bash |
|
|
python do_generate.py \ |
|
|
--model_name DataMind-Qwen2.5-7B \ # Model name to use. |
|
|
--check_model gpt-4o-mini \ # Check model to use. |
|
|
--output results \ # Output directory path. |
|
|
--dataset_name QRData \ # Dataset name to use, chosen from QRData, DiscoveryBench. |
|
|
--max_round 25 \ # Maximum number of steps. |
|
|
--api_port 8000 \ # API port number, it is necessary if the local model is used. |
|
|
--bidx 0 \ # Begin index (inclusive), `None` indicates that there is no restriction. |
|
|
--eidx None \ # End index (exclusive), `None` indicates that there is no restriction. |
|
|
--temperature 0.0 \ # Temperature for sampling. |
|
|
--top_p 1 \ # Top p for sampling. |
|
|
--add_random False \ # Whether to add random files. |
|
|
``` |
|
|
|
|
|
**(Optional)`local_model.sh`** |
|
|
|
|
|
```bash |
|
|
CUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \ |
|
|
--model $MODEL_PATH \ # Local model path. |
|
|
--served-model-name $MODEL_NAME \ # The model name specified by you. |
|
|
--tensor-parallel-size $i \ # Set the size of tensor parallel processing. |
|
|
--port $port # API port number, which is consistent with the `api_port` above. |
|
|
``` |
|
|
|
|
|
**Step 2: Run the shell script** |
|
|
|
|
|
**(Optional)** Deploy the local model if you need. |
|
|
|
|
|
```bash |
|
|
bash local_model.sh |
|
|
``` |
|
|
|
|
|
Run the shell script to start the process. |
|
|
|
|
|
```bash |
|
|
bash run_eval.sh |
|
|
``` |
|
|
|
|
|
## ✍️ Citation |
|
|
|
|
|
If you find our work helpful, please use the following citations. |
|
|
|
|
|
``` |
|
|
@article{zhu2025open, |
|
|
title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study}, |
|
|
author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu}, |
|
|
journal={arXiv preprint arXiv:2506.19794}, |
|
|
year={2025} |
|
|
} |
|
|
``` |