Update README.md

49e0d83 verified about 1 year ago

6.41 kB

license: mit
datasets:
  - vicgalle/alpaca-gpt4
language:
  - en
metrics:
  - accuracy
base_model:
  - Qwen/Qwen2.5-7B-Instruct
new_version: Qwen/Qwen2.5-7B-Instruct
library_name: peft

FlowerTune LLM on General NLP/Medical/Finance

This directory conducts federated instruction tuning with pretrained language models on the Flowertune-llm benchmark, including a general NLP dataset vicgalle/alpaca-gpt4, a Medical dataset, and a Finance dataset. We use Flower Datasets to download, partition, and preprocess the dataset. Flower's Simulation Engine is used to simulate the LLM fine-tuning process in a federated way, which allows users to perform the training on a single GPU.

Important Links

GitHub: https://github.com/zjudai/flowertune-llm
HuggingFace: https://huggingface.co/zjudai/FlowerTune
Model Collection: FlowerTune LoRA Collection

Experimental Setup

The dataset is divided into 20 partitions in an IID fashion, a partition is assigned to each ClientApp. We randomly sample a fraction (0.1) of the total nodes to participate in each round, for a total of 10 rounds. All settings are defined in pyproject.toml.

Methodology

This baseline performs federated LLM fine-tuning with LoRA using the 🤗PEFT library. The clients' models are aggregated with FedAvg strategy. This provides a baseline performance for the benchmark.

Example: Qwen2.5-7B-Instruct

For example, with the Qwen/Qwen2.5-7B-Instruct model we adopted the following fine-tuning methodology:

Precision: bf16 for model weights.
Quantization: 4-bit quantization for reduced memory usage.
LoRA Configuration:
- Rank (r): 32
- Alpha: 64
Training Configuration:
- Batch size: 8
- Maximum number of steps: 10
- Total number of rounds: 10
- Fraction fit per round: 0.1
Learning Rate Scheduler:
- Maximum LR: 5e-5
- Minimum LR: 1e-6
- Constant learning rate scheduler over steps
Strategy: FedAvg

Environment and Execution

Environment Setup

Project dependencies are defined in pyproject.toml. Install them in an activated Python environment with:

python -m pip install --upgrade pip wheel setuptools packaging

pip install -e .

Running the Training and Evaluation

We use a wrapper script run_all_experiments.sh to handle both training and evaluation processes:

# Example of running experiments
./run_all_experiments.sh --model Qwen/Qwen2.5-7B-Instruct --task general_nlp

The wrapper script sets up the proper environment, including:

Activating the conda environment
Setting up proxy configurations if needed
Executing the main experiment runner script with the provided parameters

The actual experiment workflow is implemented in run_experiments.py, which is called by the wrapper script.

Model Saving

The global PEFT model checkpoints are saved every 5 rounds after aggregation on the server side as default, which can be specified with train.save-every-round under [tool.flwr.app.config] entry in pyproject.toml.

Evaluation Results

Please find the checkpoint link by clicking each model's name.

General NLP

The evaluation was conducted on the MMLU (Massive Multitask Language Understanding) benchmark, which tests knowledge across various domains:

Model	STEM	Social Sciences	Humanities	Average	Comm. Costs
Qwen/Qwen2.5-7B-Instruct	52.52%	79.27%	60.32%	64.04%	1.50 GB
Qwen/Qwen2.5-1.5B-Instruct	47.13%	62.30%	50.54%	53.32%	0.65 GB
mistralai/Mistral-7B-Instruct-v0.3	29.94%	54.27%	44.93%	43.05%	2.03 GB
meta-llama/Llama-3.1-8B-Instruct	22.87%	39.55%	32.05%	31.49%	2.03 GB
TinyLlama/TinyLlama-1.1B-Chat-v1.0	14.18%	21.61%	21.91%	19.23%	0.67 GB
meta-llama/Llama-3.2-1B-Instruct	12.88%	17.61%	6.16%	12.22%	0.51 GB
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	0.54%	0.00%	0.04%	0.19%	0.65 GB

Medical

Model	PubMedQA	MedMCQA	MedQA	CareQA	Average	Comm. Costs
meta-llama/Llama-3.1-8B-Instruct	59.94%	69.40%	59.94%	57.74%	61.75%	2.03 GB
Qwen/Qwen2.5-7B-Instruct	65.15%	56.80%	65.15%	55.46%	60.64%	1.50 GB
mistralai/Mistral-7B-Instruct-v0.3	54.40%	55.20%	54.40%	49.80%	53.45%	2.03 GB

Finance

Model	FPB	FIQA	TFNS	Average	Comm. Costs
mistralai/Mistral-7B-Instruct-v0.3	58.63%	67.11%	58.63%	61.45%	2.03 GB

Hardware Details

For the experiments, I utilized a GPU-enabled virtual machine.

Component	Specification
GPU	1 × GPU with 16+ GB
vCPUs	6
Memory	16+ GB