File size: 6,356 Bytes
d7355a4 d0ee4d0 d66d5ec d7355a4 ddf1e19 5ca1dd9 ddf1e19 05bef30 ddf1e19 05bef30 ddf1e19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
license: apache-2.0
language:
- en
base_model:
- microsoft/Phi-4-mini-instruct
- facebook/dinov2-with-registers-giant
- google/siglip2-so400m-patch14-224
base_model_relation: adapter
pipeline_tag: image-text-to-text
---
# Aurea: Adaptive Multimodal Fusion for Vision-Language Models
<div align="center">
<img src="https://raw.githubusercontent.com/Dcas89/Aurea/refs/heads/main/assets/aurea_logo.png" alt="Aurea Logo" width="200" height="200">
</div>
Aurea is an open-source research framework centered on an adaptive spatial-range attention module that fuses spatial and semantic cues from encoder features, yielding richer, context-aware representations for downstream tasks.
[Explore the full source code and technical documentation on GitHub](https://github.com/Dcas89/Aurea)
## Key Features
- **Multiple Vision Encoders:** Input images are encoded separately by DINOv2 and SigLIP2.
- **Multi-stage Fusion:** The `SpatialRangeBlock` fuses these inputs through multiple layers of `SpatialRangeAttention`, which selectively aggregates features by jointly considering spatial proximity and semantic similarity. This is performed with a highly optimized fused CUDA kernel.
- **Flexible Language Model Integration:** While Phi-4 is the default language model, Aurea is designed for easy adaptation to other pretrained language models with minimal engineering effort.
- **Model Weights:** Two model checkpoints are provided: (1) base pretrained weights (trained on a ~558k image subset of LAION) and (2) instruction-tuned weights (further fine-tuned on ~625k samples from LLaVA 1.5 datasets). All checkpoints can be downloaded directly from this repository.
- **Extensible and Modular:** The code supports straightforward extension, experimentation, and integration with novel encoders or downstream tasks.
## Installation
1. **Clone the source repository**
```bash
git clone https://github.com/Dcas89/Aurea.git
cd Aurea
```
2. **Install Python dependencies**
```bash
pip install -r requirements.txt
```
## Usage
First, initialize the Aurea model:
```python
from entry import Aurea
aurea = Aurea(root_dir='/path/to/Aurea')
```
> **Note:** When initializing the model, all required model checkpoints will be downloaded automatically.
### Image + Text Generation (Basic)
Generate text based on an image and prompt:
```python
# Basic image + text generation
response = aurea.generate(
prompt="How many remote control devices are in this image?",
image_path='./assets/cats.png' # Example image included in the repo
)
print(response)
```
### Generation with Custom Parameters
Tune generation parameters for more control:
```python
# Advanced generation with custom parameters
response = aurea.generate(
prompt="Only one cat is wearing a collar in the image. Which cat is it? Answer Briefly: Left, Right, or Both",
image_path='./assets/cats.png', # Example image included in the repo
max_new_tokens=50, # Maximum number of tokens to generate
temperature=0.1, # Lower values make output more deterministic
repetition_penalty=1.1, # Penalizes token repetition (>1.0)
filter_kwargs={'thres': 0.90, 'top_k': 50}, # Parameters for filtering function
use_dynamic_top_k=False, # Whether to use dynamic top-k sampling
min_top_k=50, # Minimum top-k value if using dynamic top-k
max_top_k=90, # Maximum top-k value if using dynamic top-k
filter_fn=None, # Custom filtering function
exclude_prompt=True # Whether to exclude prompt from returned text
)
print(response)
```
### Logit Filtering
Using a specific filtering function (e.g., top_p):
```python
from generate import top_p
response = aurea.generate(
prompt="Only one cat is wearing a collar in the image. What is the color of the collar? Answer Briefly: Blue, Light Green, Yellow",
image_path='./assets/cats.png', # Example image included in the repo
max_new_tokens=50,
temperature=0.1,
repetition_penalty=1.1,
filter_kwargs={'thres': 0.99, 'top_k': 50},
filter_fn=top_p, # Using top-p sampling
exclude_prompt=True
)
print(response)
```
### Dynamic Top-K Sampling
Example using dynamic top-k sampling (interpolating from max_top_k to min_top_k over generation):
```python
response = aurea.generate(
prompt="What does the logo say and what does it represent?",
image_path='./assets/mazure.png',
max_new_tokens=100,
temperature=0.1,
repetition_penalty=1.1,
filter_kwargs={'thres': 0.99, 'top_k': 50},
use_dynamic_top_k=True, # Enable dynamic top-k sampling
min_top_k=50, # Lower bound for top-k
max_top_k=90, # Upper bound for top-k
filter_fn=None,
exclude_prompt=True
)
print(response)
```
### Text-Only Generation
Aurea can also be used for text-only tasks:
```python
# Text-only generation (no image)
response = aurea.generate(
prompt="What is CUDA programming?",
max_new_tokens=200,
temperature=0.1,
repetition_penalty=1.1,
filter_kwargs={'thres': 0.9, 'top_k': 50},
exclude_prompt=True
)
print(response)
```
## References
- [SigLIP 2: Multilingual Vision-Language Encoders](https://doi.org/10.48550/arXiv.2502.14786)
- [Phi-4 Technical Report](https://doi.org/10.48550/arXiv.2412.08905)
- [DINOv2: Learning Robust Visual Features without Supervision](https://doi.org/10.48550/arXiv.2304.07193)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)
## License
This project is released under the Apache 2.0 License.
## Acknowledgements
- The CUDA spatial-range attention is inspired by and adapted from LLaVA-UHD.
- Some components were adapted from [lucidrains](https://github.com/lucidrains) repositories, which provide excellent implementations of various transformer and attention mechanisms.
- Thanks to the open-source community for DINOv2, SigLIP2, LLaVA, LlaVA-UHD, and Phi-4.
- Thanks to Hugging Face for their [Transformers](https://github.com/huggingface/transformers) and [Accelerate](https://github.com/huggingface/accelerate) libraries.
This project incorporates code and models from:
- Phi-4 Mini: Copyright (c) 2025 Microsoft Corporation
- DINOv2: Copyright (c) 2024 Meta Platforms, Inc.
- SigLIP2: Copyright (c) 2025 Google LLC |