RegionRet / README.md
Aeryn666's picture
Upload folder using huggingface_hub
c737ebe verified
---
base_model: colqwen2.5-base
library_name: peft
---
# RegionRet
RegionRet is a LoRA adapter model for region-level vision-language retrieval, fine-tuned from ColQwen2.5-Base using Parameter-Efficient Fine-Tuning (PEFT).
## Model Details
- **Model Type:** LoRA Adapter (PEFT)
- **Base Model:** ColQwen2.5-Base
- **Task Type:** Feature Extraction
- **Framework:** PEFT 0.14.0
### LoRA Configuration
- **Rank (r):** 32
- **LoRA Alpha:** 32
- **LoRA Dropout:** 0.1
- **Target Modules:** MLP projections (down_proj, gate_proj, up_proj) and attention projections (k_proj, q_proj, v_proj, o_proj), plus custom_text_proj
### Model Architecture
- **Processor:** ColQwen2_5_Processor
- **Max Visual Tokens:** 1536
- **Attention:** Flash Attention 2
- **Precision:** bfloat16
## Uses
Please refer to [https://github.com/Aeryn666/RegionRAG](https://github.com/Aeryn666/RegionRAG).
## Training Details
### Training Data
- VisRAG-Ret-Train-In-domain-data
- Visual-CoT (DocVQA, TextCap, TextVQA, InfographicsVQA)
### Training Configuration
- **Loss Function:** RegionContraLoss (global_tau=0.02, local_tau=0.25, local_coef=0.01)
- **Epochs:** 5
- **Batch Size:** 80 per device
- **Learning Rate:** 2e-4
- **Precision:** bfloat16
- **Gradient Checkpointing:** Enabled
## Limitations
- Requires ColQwen2.5-Base base model to function
- Optimized for region-level vision-language retrieval tasks
- GPU with bfloat16 and Flash Attention 2 support recommended
## Citation
If you use this model, please cite:
```bibtex
@misc{li2025regionragregionlevelretrievalaugmentedgeneration,
title={RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding},
author={Yinglu Li and Zhiying Lu and Zhihang Liu and Yiwei Sun and Chuanbin Liu and Hongtao Xie},
year={2025},
eprint={2510.27261},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.27261},
}
```
## License
Please refer to the license of the base model ColQwen2.5.