GENOME-gemma-2b-it / README.md

Update README.md

01521f1 verified 12 months ago

5.48 kB

	---
	license: mit
	datasets:
	- allenai/tulu-v2-sft-mixture
	language:
	- en
	base_model:
	- google/gemma-2-2b-it
	framework:
	- llamafactory
	---

	# GENOME: LoRA Expert Models

	This repository contains 10 expert models fine-tuned via low-rank adaptation (LoRA) on 10 distinct domains extracted from the [Tulu-v2-SFT-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) dataset. Our base model is google/gemma-2-2b-it, and all expert models were trained using the llama-factory framework on an 8×A100-80GB GPU setup. Our goal is to contribute to the open-source community by sharing these domain-specific experts.

	## Experimental Setup

	- Base Model: [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it)
	- Dataset: 10 subsets from Tulu-v2-SFT-mixture
	- Fine-tuning Framework: llama-factory
	- Adaptation Technique: LoRA
	- Training Hardware: 8×A100-80GB GPUs
	- Note: Deploying a 2B model only requires 12GB of VRAM. For optimal performance, we recommend using an RTX 3090/4090 (24GB) or a comparable GPU.

	A visualization of the performance (ranks) across various datasets shows that each expert model excels in its respective domain.
	vLLM supports dynamic LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead, enabling cost-effective optimization.

	## Usage Instructions

	Below is an example deployment script that shows how to use vLLM to serve the base model along with the LoRA weights on a single GPU (adapted from the original multi-GPU script). Make sure to adjust the parameters (such as model path and log directory) to suit your environment.

	### Step 1. Deploying the Base Model on a Single GPU (or more)

	Save the following script as `deploy_single_gpu.sh` and modify the placeholders accordingly:

	```bash
	#!/bin/bash
	export VLLM_LOGGING_LEVEL=DEBUG
	export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
	export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

	# Specify your model path here (this can be a local path or a Hugging Face Hub path)
	MODEL="input your model path here"
	# Set the maximum number of LoRAs
	MAX_LORAS=20
	# Log directory for vLLM logs
	ROOT="input your log dir here"
	# Maximum LoRA rank
	MAX_LORA_RANK=16
	# Specify the port for the API server (single GPU deployment requires only one port)
	PORT=9112

	echo "Deploying model $MODEL with $MAX_LORAS LoRAs on a single GPU"
	echo "Starting API server on port $PORT..."

	# Create the log directory if it doesn't exist
	mkdir -p vllm_logs/$ROOT

	COMMON_ARGS="--model $MODEL \
	--trust-remote-code \
	--enable-lora \
	--seed 42 \
	--max-lora-rank $MAX_LORA_RANK \
	--gpu-memory-utilization 0.95 \
	--max-loras $MAX_LORAS \
	--max-cpu-loras $MAX_LORAS \
	--disable-sliding-window \
	--max-model-len 8192"

	# Single GPU deployment: use only GPU 0
	CUDA_VISIBLE_DEVICES=0 nohup python -m vllm.entrypoints.openai.api_server \
	$COMMON_ARGS \
	--port $PORT > vllm_logs/$ROOT/port_1.log 2>&1 &
	```
	### Step 2. Loading and Unloading LoRA Adapters Dynamically

	vLLM supports online LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead.

	1. Download the LoRA weights and store them under `/lora/*`.
	2. Use the following Python code to load and unload LoRA adapters dynamically:
	```python
	import requests
	import time
	from loguru import logger


	def online_load_lora(base_url: str, lora_name: str, lora_path: str):
	counter = 1
	while True:
	try:
	response = requests.post(
	f"{base_url}/load_lora_adapter",
	json={
	"lora_name": lora_name,
	"lora_path": lora_path
	}
	)
	time.sleep(3)
	assert response.status_code == 200, f"Failed to load LORA: {response.text}"
	break
	except Exception as e:
	logger.warning(f"Load LORA Error: {e}, retrying in {min(counter, 10)} seconds ...")
	time.sleep(min(counter, 10))
	counter += 1
	continue

	def online_unload_lora(base_url: str, lora_name: str):
	while True:
	try:
	response = requests.post(
	f"{base_url}/unload_lora_adapter",
	json={
	"lora_name": lora_name
	}
	)
	assert response.status_code == 200, f"Failed to unload LORA: {response.text}"
	break
	except Exception as e:
	logger.warning(f"Unload LORA Error: {e}, retrying ...")
	time.sleep(1)
	continue
	```
	### Step 3: Using OpenAI SDK to Access the Deployed LoRA Models

	Once the LoRA model is loaded, you can interact with it using the OpenAI SDK. Below is a mock example:
	```python
	import openai

	def query_lora_model(base_url: str, lora_name: str, prompt: str):
	client = openai.OpenAI(base_url=base_url)
	response = client.Completion.create(
	model=lora_name,
	prompt=prompt,
	max_tokens=100
	)
	return response

	# Example usage
	base_url = "http://localhost:9112/v1"
	lora_name = "example_lora"
	prompt = "Tell me about the impact of AI in healthcare."
	response_text = query_lora_model(base_url, lora_name, prompt)
	print(response_text)
	```

	## Related Projects

	This repository is associated with the [GENOME project](https://github.com/ZhangYiqun018/GENOME). We welcome community feedback and contributions to help further open-source AI development.