|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- allenai/tulu-v2-sft-mixture |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- google/gemma-2-2b-it |
|
|
framework: |
|
|
- llamafactory |
|
|
--- |
|
|
|
|
|
# GENOME: LoRA Expert Models |
|
|
|
|
|
This repository contains 10 expert models fine-tuned via low-rank adaptation (LoRA) on 10 distinct domains extracted from the [Tulu-v2-SFT-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) dataset. Our base model is **google/gemma-2-2b-it**, and all expert models were trained using the llama-factory framework on an 8×A100-80GB GPU setup. Our goal is to contribute to the open-source community by sharing these domain-specific experts. |
|
|
|
|
|
## Experimental Setup |
|
|
|
|
|
- **Base Model:** [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it) |
|
|
- **Dataset:** 10 subsets from Tulu-v2-SFT-mixture |
|
|
- **Fine-tuning Framework:** llama-factory |
|
|
- **Adaptation Technique:** LoRA |
|
|
- **Training Hardware:** 8×A100-80GB GPUs |
|
|
- **Note**: Deploying a 2B model only requires 12GB of VRAM. For optimal performance, we recommend using an RTX 3090/4090 (24GB) or a comparable GPU. |
|
|
|
|
|
A visualization of the performance (ranks) across various datasets shows that each expert model excels in its respective domain. |
|
|
vLLM supports dynamic LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead, enabling cost-effective optimization. |
|
|
|
|
|
## Usage Instructions |
|
|
|
|
|
Below is an example deployment script that shows how to use vLLM to serve the base model along with the LoRA weights on a single GPU (adapted from the original multi-GPU script). Make sure to adjust the parameters (such as model path and log directory) to suit your environment. |
|
|
|
|
|
### Step 1. Deploying the Base Model on a Single GPU (or more) |
|
|
|
|
|
Save the following script as `deploy_single_gpu.sh` and modify the placeholders accordingly: |
|
|
|
|
|
```bash |
|
|
#!/bin/bash |
|
|
export VLLM_LOGGING_LEVEL=DEBUG |
|
|
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True |
|
|
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 |
|
|
|
|
|
# Specify your model path here (this can be a local path or a Hugging Face Hub path) |
|
|
MODEL="input your model path here" |
|
|
# Set the maximum number of LoRAs |
|
|
MAX_LORAS=20 |
|
|
# Log directory for vLLM logs |
|
|
ROOT="input your log dir here" |
|
|
# Maximum LoRA rank |
|
|
MAX_LORA_RANK=16 |
|
|
# Specify the port for the API server (single GPU deployment requires only one port) |
|
|
PORT=9112 |
|
|
|
|
|
echo "Deploying model $MODEL with $MAX_LORAS LoRAs on a single GPU" |
|
|
echo "Starting API server on port $PORT..." |
|
|
|
|
|
# Create the log directory if it doesn't exist |
|
|
mkdir -p vllm_logs/$ROOT |
|
|
|
|
|
COMMON_ARGS="--model $MODEL \ |
|
|
--trust-remote-code \ |
|
|
--enable-lora \ |
|
|
--seed 42 \ |
|
|
--max-lora-rank $MAX_LORA_RANK \ |
|
|
--gpu-memory-utilization 0.95 \ |
|
|
--max-loras $MAX_LORAS \ |
|
|
--max-cpu-loras $MAX_LORAS \ |
|
|
--disable-sliding-window \ |
|
|
--max-model-len 8192" |
|
|
|
|
|
# Single GPU deployment: use only GPU 0 |
|
|
CUDA_VISIBLE_DEVICES=0 nohup python -m vllm.entrypoints.openai.api_server \ |
|
|
$COMMON_ARGS \ |
|
|
--port $PORT > vllm_logs/$ROOT/port_1.log 2>&1 & |
|
|
``` |
|
|
### Step 2. Loading and Unloading LoRA Adapters Dynamically |
|
|
|
|
|
vLLM supports online LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead. |
|
|
|
|
|
1. Download the LoRA weights and store them under `/lora/*`. |
|
|
2. Use the following Python code to load and unload LoRA adapters dynamically: |
|
|
```python |
|
|
import requests |
|
|
import time |
|
|
from loguru import logger |
|
|
|
|
|
|
|
|
def online_load_lora(base_url: str, lora_name: str, lora_path: str): |
|
|
counter = 1 |
|
|
while True: |
|
|
try: |
|
|
response = requests.post( |
|
|
f"{base_url}/load_lora_adapter", |
|
|
json={ |
|
|
"lora_name": lora_name, |
|
|
"lora_path": lora_path |
|
|
} |
|
|
) |
|
|
time.sleep(3) |
|
|
assert response.status_code == 200, f"Failed to load LORA: {response.text}" |
|
|
break |
|
|
except Exception as e: |
|
|
logger.warning(f"Load LORA Error: {e}, retrying in {min(counter, 10)} seconds ...") |
|
|
time.sleep(min(counter, 10)) |
|
|
counter += 1 |
|
|
continue |
|
|
|
|
|
def online_unload_lora(base_url: str, lora_name: str): |
|
|
while True: |
|
|
try: |
|
|
response = requests.post( |
|
|
f"{base_url}/unload_lora_adapter", |
|
|
json={ |
|
|
"lora_name": lora_name |
|
|
} |
|
|
) |
|
|
assert response.status_code == 200, f"Failed to unload LORA: {response.text}" |
|
|
break |
|
|
except Exception as e: |
|
|
logger.warning(f"Unload LORA Error: {e}, retrying ...") |
|
|
time.sleep(1) |
|
|
continue |
|
|
``` |
|
|
### Step 3: Using OpenAI SDK to Access the Deployed LoRA Models |
|
|
|
|
|
Once the LoRA model is loaded, you can interact with it using the OpenAI SDK. Below is a mock example: |
|
|
```python |
|
|
import openai |
|
|
|
|
|
def query_lora_model(base_url: str, lora_name: str, prompt: str): |
|
|
client = openai.OpenAI(base_url=base_url) |
|
|
response = client.Completion.create( |
|
|
model=lora_name, |
|
|
prompt=prompt, |
|
|
max_tokens=100 |
|
|
) |
|
|
return response |
|
|
|
|
|
# Example usage |
|
|
base_url = "http://localhost:9112/v1" |
|
|
lora_name = "example_lora" |
|
|
prompt = "Tell me about the impact of AI in healthcare." |
|
|
response_text = query_lora_model(base_url, lora_name, prompt) |
|
|
print(response_text) |
|
|
``` |
|
|
|
|
|
## Related Projects |
|
|
|
|
|
This repository is associated with the [GENOME project](https://github.com/ZhangYiqun018/GENOME). We welcome community feedback and contributions to help further open-source AI development. |