File size: 6,356 Bytes
d7355a4
 
 
 
 
 
 
 
d0ee4d0
d66d5ec
d7355a4
 
ddf1e19
 
5ca1dd9
 
 
 
 
ddf1e19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05bef30
ddf1e19
 
 
 
 
 
 
 
 
 
 
 
 
05bef30
ddf1e19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
language:
- en
base_model:
- microsoft/Phi-4-mini-instruct
- facebook/dinov2-with-registers-giant
- google/siglip2-so400m-patch14-224
base_model_relation: adapter
pipeline_tag: image-text-to-text
---

# Aurea: Adaptive Multimodal Fusion for Vision-Language Models

<div align="center">
  <img src="https://raw.githubusercontent.com/Dcas89/Aurea/refs/heads/main/assets/aurea_logo.png" alt="Aurea Logo" width="200" height="200">
</div>

Aurea is an open-source research framework centered on an adaptive spatial-range attention module that fuses spatial and semantic cues from encoder features, yielding richer, context-aware representations for downstream tasks.

[Explore the full source code and technical documentation on GitHub](https://github.com/Dcas89/Aurea)

## Key Features

- **Multiple Vision Encoders:** Input images are encoded separately by DINOv2 and SigLIP2.

- **Multi-stage Fusion:** The `SpatialRangeBlock` fuses these inputs through multiple layers of `SpatialRangeAttention`, which selectively aggregates features by jointly considering spatial proximity and semantic similarity. This is performed with a highly optimized fused CUDA kernel.

- **Flexible Language Model Integration:** While Phi-4 is the default language model, Aurea is designed for easy adaptation to other pretrained language models with minimal engineering effort.

- **Model Weights:** Two model checkpoints are provided: (1) base pretrained weights (trained on a ~558k image subset of LAION) and (2) instruction-tuned weights (further fine-tuned on ~625k samples from LLaVA 1.5 datasets). All checkpoints can be downloaded directly from this repository.

- **Extensible and Modular:** The code supports straightforward extension, experimentation, and integration with novel encoders or downstream tasks.

## Installation

1. **Clone the source repository**

```bash
git clone https://github.com/Dcas89/Aurea.git
cd Aurea
```

2. **Install Python dependencies**

```bash
pip install -r requirements.txt
```

## Usage

First, initialize the Aurea model:

```python
from entry import Aurea

aurea = Aurea(root_dir='/path/to/Aurea')
```

> **Note:** When initializing the model, all required model checkpoints will be downloaded automatically.

### Image + Text Generation (Basic)

Generate text based on an image and prompt:

```python
# Basic image + text generation
response = aurea.generate(
    prompt="How many remote control devices are in this image?", 
    image_path='./assets/cats.png'  # Example image included in the repo
)
print(response)
```

### Generation with Custom Parameters

Tune generation parameters for more control:

```python
# Advanced generation with custom parameters
response = aurea.generate(
    prompt="Only one cat is wearing a collar in the image. Which cat is it? Answer Briefly: Left, Right, or Both", 
    image_path='./assets/cats.png',  # Example image included in the repo
    max_new_tokens=50,          # Maximum number of tokens to generate
    temperature=0.1,            # Lower values make output more deterministic
    repetition_penalty=1.1,     # Penalizes token repetition (>1.0)
    filter_kwargs={'thres': 0.90, 'top_k': 50},  # Parameters for filtering function
    use_dynamic_top_k=False,    # Whether to use dynamic top-k sampling
    min_top_k=50,               # Minimum top-k value if using dynamic top-k
    max_top_k=90,               # Maximum top-k value if using dynamic top-k
    filter_fn=None,             # Custom filtering function
    exclude_prompt=True         # Whether to exclude prompt from returned text
)
print(response)
```

### Logit Filtering

Using a specific filtering function (e.g., top_p):

```python
from generate import top_p

response = aurea.generate(
    prompt="Only one cat is wearing a collar in the image. What is the color of the collar? Answer Briefly: Blue, Light Green, Yellow", 
    image_path='./assets/cats.png',  # Example image included in the repo
    max_new_tokens=50,
    temperature=0.1,
    repetition_penalty=1.1,
    filter_kwargs={'thres': 0.99, 'top_k': 50},
    filter_fn=top_p,            # Using top-p sampling
    exclude_prompt=True
)
print(response)
```

### Dynamic Top-K Sampling

Example using dynamic top-k sampling (interpolating from max_top_k to min_top_k over generation):

```python
response = aurea.generate(
    prompt="What does the logo say and what does it represent?", 
    image_path='./assets/mazure.png',
    max_new_tokens=100,
    temperature=0.1,
    repetition_penalty=1.1,
    filter_kwargs={'thres': 0.99, 'top_k': 50},
    use_dynamic_top_k=True,     # Enable dynamic top-k sampling
    min_top_k=50,               # Lower bound for top-k
    max_top_k=90,               # Upper bound for top-k
    filter_fn=None,
    exclude_prompt=True
)

print(response)
```

### Text-Only Generation

Aurea can also be used for text-only tasks:

```python
# Text-only generation (no image)
response = aurea.generate(
    prompt="What is CUDA programming?",
    max_new_tokens=200,
    temperature=0.1,
    repetition_penalty=1.1,
    filter_kwargs={'thres': 0.9, 'top_k': 50},
    exclude_prompt=True
)
print(response)
```

## References

- [SigLIP 2: Multilingual Vision-Language Encoders](https://doi.org/10.48550/arXiv.2502.14786)
- [Phi-4 Technical Report](https://doi.org/10.48550/arXiv.2412.08905)
- [DINOv2: Learning Robust Visual Features without Supervision](https://doi.org/10.48550/arXiv.2304.07193)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)

## License

This project is released under the Apache 2.0 License.

## Acknowledgements

- The CUDA spatial-range attention is inspired by and adapted from LLaVA-UHD.
- Some components were adapted from [lucidrains](https://github.com/lucidrains) repositories, which provide excellent implementations of various transformer and attention mechanisms.
- Thanks to the open-source community for DINOv2, SigLIP2, LLaVA, LlaVA-UHD, and Phi-4.
- Thanks to Hugging Face for their [Transformers](https://github.com/huggingface/transformers) and [Accelerate](https://github.com/huggingface/accelerate) libraries.

This project incorporates code and models from:

- Phi-4 Mini: Copyright (c) 2025 Microsoft Corporation
- DINOv2: Copyright (c) 2024 Meta Platforms, Inc.
- SigLIP2: Copyright (c) 2025 Google LLC