Dcas89 commited on
Commit
ddf1e19
·
verified ·
1 Parent(s): 00cc8a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -3
README.md CHANGED
@@ -1,3 +1,165 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Aurea: Adaptive Multimodal Fusion for Vision-Language Models
2
+
3
+ Aurea is an open-source research project aimed at advancing vision-language model (VLM) pretraining by leveraging cutting-edge vision encoders—DINOv2 and SigLIP2. The core of Aurea is a novel adaptive **spatial-range attention mechanism** that intelligently fuses spatial and semantic information from encoder-derived visual features, enabling richer and more context-aware representations for various downstream tasks.
4
+
5
+ [Explore the full source code and technical documentation on GitHub](https://github.com/Dcas89/Aurea)
6
+
7
+ ## Key Features
8
+
9
+ - **Multiple Vision Encoders:** Input images are encoded separately by DINOv2 and SigLIP2.
10
+
11
+ - **Multi-stage Fusion:** The `SpatialRangeBlock` fuses these inputs through multiple layers of `SpatialRangeAttention`, which selectively aggregates features by jointly considering spatial proximity and semantic similarity. This is performed with a highly optimized fused CUDA kernel.
12
+
13
+ - **Flexible Language Model Integration:** While Phi-4 is the default language model, Aurea is designed for easy adaptation to other pretrained language models with minimal engineering effort.
14
+
15
+ - **Model Weights:** Two model checkpoints are provided: (1) base pretrained weights (trained on a ~558k image subset of LAION) and (2) instruction-tuned weights (further fine-tuned on ~625k samples from LLaVA 1.5 datasets). All checkpoints can be downloaded directly from this repository.
16
+
17
+ - **Extensible and Modular:** The code supports straightforward extension, experimentation, and integration with novel encoders or downstream tasks.
18
+
19
+ ## Installation
20
+
21
+ 1. **Clone the source repository**
22
+
23
+ ```bash
24
+ git clone https://github.com/Dcas89/Aurea.git
25
+ cd Aurea
26
+ ```
27
+
28
+ 2. **Install Python dependencies**
29
+
30
+ ```bash
31
+ pip install -r requirements.txt
32
+ ```
33
+
34
+ ## Usage
35
+
36
+ First, initialize the Aurea model:
37
+
38
+ ```python
39
+ from entry import Aurea
40
+
41
+ aurea = Aurea(root_dir='/path/to/Aurea')
42
+ ```
43
+
44
+ > **Note:** When initializing the model, all required model checkpoints will be downloaded automatically.
45
+
46
+ ### Image + Text Generation (Basic)
47
+
48
+ Generate text based on an image and prompt:
49
+
50
+ ```python
51
+ # Basic image + text generation
52
+ response = aurea.generate(
53
+ prompt="How many remote control devices are in this image?",
54
+ image_path='./assets/cats.png' # Example image included in the repo
55
+ )
56
+ print(response)
57
+ ```
58
+
59
+ ### Generation with Custom Parameters
60
+
61
+ Tune generation parameters for more control:
62
+
63
+ ```python
64
+ # Advanced generation with custom parameters
65
+ response = aurea.generate(
66
+ prompt="Only one cat is wearing a collar in the image. Which cat is it? Answer Briefly: Left, Right, or Both",
67
+ image_path='./assets/cats.png', # Example image included in the repo
68
+ max_new_tokens=50, # Maximum number of tokens to generate
69
+ temperature=0.1, # Lower values make output more deterministic
70
+ repetition_penalty=1.1, # Penalizes token repetition (>1.0)
71
+ filter_kwargs={'thres': 0.90, 'top_k': 50}, # Parameters for filtering function
72
+ use_dynamic_top_k=False, # Whether to use dynamic top-k sampling
73
+ min_top_k=50, # Minimum top-k value if using dynamic top-k
74
+ max_top_k=90, # Maximum top-k value if using dynamic top-k
75
+ filter_fn=None, # Custom filtering function
76
+ exclude_prompt=True # Whether to exclude prompt from returned text
77
+ )
78
+ print(response)
79
+ ```
80
+
81
+ ### Logit Filtering
82
+
83
+ Using a specific filtering function (e.g., top_p):
84
+
85
+ ```python
86
+ from generate import top_p
87
+
88
+ response = aurea.generate(
89
+ prompt="Only one cat is wearing a collar in the image. What is the color of the collar? Answer Briefly: Blue, Light Green, Yellow",
90
+ image_path='./assets/cats.png', # Example image included in the repo
91
+ max_new_tokens=50,
92
+ temperature=0.1,
93
+ repetition_penalty=1.1,
94
+ filter_kwargs={'thres': 0.99, 'top_k': 50},
95
+ filter_fn=top_p, # Using top-p sampling
96
+ exclude_prompt=True
97
+ )
98
+ print(response)
99
+ ```
100
+
101
+ ### Dynamic Top-K Sampling
102
+
103
+ Example using dynamic top-k sampling (interpolating from max_top_k to min_top_k over generation):
104
+
105
+ ```python
106
+ from generate import top_p
107
+
108
+ text = aurea.generate(
109
+ prompt="What does the logo say and what does it represent?",
110
+ image_path='./assets/mazure.png',
111
+ max_new_tokens=100,
112
+ temperature=0.1,
113
+ repetition_penalty=1.1,
114
+ filter_kwargs={'thres': 0.99, 'top_k': 50},
115
+ use_dynamic_top_k=True, # Enable dynamic top-k sampling
116
+ min_top_k=50, # Lower bound for top-k
117
+ max_top_k=90, # Upper bound for top-k
118
+ filter_fn=None,
119
+ exclude_prompt=True
120
+ )
121
+
122
+ print(text)
123
+ ```
124
+
125
+ ### Text-Only Generation
126
+
127
+ Aurea can also be used for text-only tasks:
128
+
129
+ ```python
130
+ # Text-only generation (no image)
131
+ response = aurea.generate(
132
+ prompt="What is CUDA programming?",
133
+ max_new_tokens=200,
134
+ temperature=0.1,
135
+ repetition_penalty=1.1,
136
+ filter_kwargs={'thres': 0.9, 'top_k': 50},
137
+ exclude_prompt=True
138
+ )
139
+ print(response)
140
+ ```
141
+
142
+ ## References
143
+
144
+ - [SigLIP 2: Multilingual Vision-Language Encoders](https://doi.org/10.48550/arXiv.2502.14786)
145
+ - [Phi-4 Technical Report](https://doi.org/10.48550/arXiv.2412.08905)
146
+ - [DINOv2: Learning Robust Visual Features without Supervision](https://doi.org/10.48550/arXiv.2304.07193)
147
+ - [LLaVA](https://github.com/haotian-liu/LLaVA)
148
+ - [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)
149
+
150
+ ## License
151
+
152
+ This project is released under the Apache 2.0 License.
153
+
154
+ ## Acknowledgements
155
+
156
+ - The CUDA spatial-range attention is inspired by and adapted from LLaVA-UHD.
157
+ - Some components were adapted from [lucidrains](https://github.com/lucidrains) repositories, which provide excellent implementations of various transformer and attention mechanisms.
158
+ - Thanks to the open-source community for DINOv2, SigLIP2, LLaVA, LlaVA-UHD, and Phi-4.
159
+ - Thanks to Hugging Face for their [Transformers](https://github.com/huggingface/transformers) and [Accelerate](https://github.com/huggingface/accelerate) libraries.
160
+
161
+ This project incorporates code and models from:
162
+
163
+ - Phi-4 Mini: Copyright (c) 2025 Microsoft Corporation
164
+ - DINOv2: Copyright (c) 2024 Meta Platforms, Inc.
165
+ - SigLIP2: Copyright (c) 2025 Google LLC