|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- moe |
|
|
- olmo |
|
|
- olmoe |
|
|
co2_eq_emissions: 1 |
|
|
datasets: |
|
|
- allenai/OLMoE-mix-0924 |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
|
|
|
# Model Summary |
|
|
# OLMoE with Adapters |
|
|
|
|
|
This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by: |
|
|
|
|
|
1. Adding small adapter layers (bottleneck layers) to each MLP block |
|
|
2. Allowing selective freezing of the base model's parameters |
|
|
3. Training only the adapter parameters (~0.1-1% of total parameters) |
|
|
|
|
|
Key components: |
|
|
- `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules |
|
|
- `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs |
|
|
- `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers |
|
|
- `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters |
|
|
|
|
|
## Training Script |
|
|
|
|
|
The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model: |
|
|
|
|
|
### Features: |
|
|
- Parameter-efficient fine-tuning using adapters |
|
|
- Support for various datasets through Hugging Face datasets library |
|
|
- Customizable adapter size |
|
|
- Option to freeze/unfreeze different components |
|
|
- Training with AdamW optimizer and learning rate scheduling |
|
|
- Evaluation with perplexity metrics |
|
|
- Model checkpointing and saving |
|
|
|
|
|
### Usage: |
|
|
|
|
|
```bash |
|
|
python train.py \ |
|
|
--model_name_or_path allenai/OLMo-7B \ |
|
|
--adapter_size 64 \ |
|
|
--freeze_base_model True \ |
|
|
--dataset_name wikitext \ |
|
|
--dataset_config_name wikitext-2-raw-v1 \ |
|
|
--output_dir ./olmoe-adapter-finetuned \ |
|
|
--num_train_epochs 3 \ |
|
|
--per_device_train_batch_size 4 \ |
|
|
--per_device_eval_batch_size 4 \ |
|
|
--learning_rate 5e-5 \ |
|
|
--warmup_steps 100 \ |
|
|
--logging_steps 100 \ |
|
|
--save_steps 1000 \ |
|
|
--seed 42 |
|
|
``` |
|
|
|
|
|
## Benefits of Adapter-Based Fine-Tuning |
|
|
|
|
|
1. **Efficiency**: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements |
|
|
2. **Storage**: Store only adapter weights rather than full fine-tuned models |
|
|
3. **Composability**: Multiple adapters can be trained for different tasks and swapped at inference time |
|
|
4. **Reduced Overfitting**: Lower parameter count helps prevent overfitting on small datasets |
|
|
|
|
|
## How to Use the Fine-Tuned Model |
|
|
|
|
|
```python |
|
|
from transformers import OlmoTokenizer |
|
|
from modeling_olmoe import OlmoEWithAdaptersForCausalLM |
|
|
|
|
|
# Load the fine-tuned model |
|
|
model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned") |
|
|
tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned") |
|
|
|
|
|
# Generate text |
|
|
inputs = tokenizer("Once upon a time", return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=50) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Adapter Size Recommendations |
|
|
|
|
|
The adapter size determines the parameter efficiency vs. performance trade-off: |
|
|
|
|
|
- **Small datasets**: 16-32 dimensions |
|
|
- **Medium datasets**: 64-128 dimensions |
|
|
- **Large datasets**: 128-256 dimensions |
|
|
|
|
|
For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance. |