raxder-ai commited on
Commit
04d684a
·
verified ·
1 Parent(s): 4465e83

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +44 -50
README.md CHANGED
@@ -20,17 +20,17 @@ tags:
20
  inference: true
21
  ---
22
 
23
- # Rax 4.5 - Efficient 2B Vision Language Model | Multimodal AI
24
 
25
- **Rax 4.5** is a state-of-the-art 2 billion parameter multimodal vision-language model optimized for production use. Process images and text together with up to 262K token context length.
26
 
27
- ## 🚀 Why Rax 4.5?
28
 
29
- - **⚡ Fast & Efficient**: Only 2B parameters for quick inference
30
- - **🖼️ Vision + Text**: True multimodal understanding of images and language
31
- - **📏 Long Context**: 262,144 token context window for complex tasks
32
- - **🔧 Production Ready**: Works with vLLM, SGLang, Transformers out of the box
33
- - **💾 Memory Efficient**: Hybrid attention architecture reduces VRAM usage
34
 
35
  ## Model Specifications
36
 
@@ -45,13 +45,13 @@ inference: true
45
  | **Precision** | BFloat16 |
46
  | **License** | Apache 2.0 |
47
 
48
- ## 🔥 Key Capabilities
49
 
50
- **Image Understanding** - Analyze, describe, and answer questions about images
51
- **Visual Question Answering** - Extract information from screenshots, documents, charts
52
- **Multimodal Reasoning** - Combine visual and textual information for complex tasks
53
- **Long Context Processing** - Handle extensive documents with visual elements
54
- **Production Deployment** - Optimized for real-world applications
55
 
56
  ## Quick Start
57
 
@@ -99,10 +99,9 @@ outputs = model.generate(**inputs, max_new_tokens=512)
99
  print(processor.decode(outputs[0], skip_special_tokens=True))
100
  \`\`\`
101
 
102
- ### Deploy with vLLM (High Performance)
103
 
104
  \`\`\`bash
105
- # Start vLLM server
106
  vllm serve raxcore-dev/rax-3.5-chat --port 8000 --max-model-len 8192
107
  \`\`\`
108
 
@@ -124,49 +123,49 @@ response = client.chat.completions.create(
124
  print(response.choices[0].message.content)
125
  \`\`\`
126
 
127
- ## 🏗️ Architecture Details
128
 
129
- - **Hybrid Attention Mechanism**: Alternates between linear and full attention for efficiency
130
- - **Vision Transformer**: 24-layer encoder with 16x16 patch size, 2x2 spatial merging
131
- - **Optimized KV Cache**: 2 key-value heads for 75% memory reduction
132
- - **Multi-Resolution Position Embeddings**: Handles various image sizes and long sequences
133
- - **Cross-Modal Fusion**: Advanced alignment between vision and language representations
134
 
135
- ## 📊 Use Cases
136
 
137
- - **Document Analysis**: Extract data from invoices, receipts, forms
138
- - **Visual QA Systems**: Build AI that answers questions about images
139
- - **Content Moderation**: Analyze images with contextual understanding
140
- - **Educational Tools**: Explain diagrams, charts, and scientific images
141
- - **Accessibility**: Generate detailed image descriptions for visually impaired users
142
- - **E-commerce**: Product analysis and description generation
143
- - **Medical Imaging**: Assist with image interpretation (not diagnostic)
144
 
145
- ## ⚙️ Performance Tips
146
 
147
- - **Temperature**: Use 0.6-0.8 for factual tasks, 0.8-1.0 for creative content
148
- - **Context Window**: For >32K tokens, ensure 24GB+ VRAM
149
- - **Batch Processing**: Process multiple images/texts together for efficiency
150
- - **Quantization**: Use 4-bit/8-bit quantization for lower memory footprint
151
- - **GPU Requirements**: Minimum 12GB VRAM (16GB recommended)
152
 
153
- ## 🚨 Limitations
154
 
155
  - 2B parameters may struggle with highly complex reasoning vs larger models
156
  - Vision encoder optimized for natural images (not specialized medical/satellite imagery)
157
  - Long context (>100K tokens) requires significant GPU memory
158
  - Not fine-tuned for specific domains without additional training
159
 
160
- ## 🤝 Model Comparison
161
 
162
  | Model | Params | Context | Multimodal | Speed |
163
  |-------|--------|---------|------------|-------|
164
- | **Rax 4.5** | 2B | 262K | | ⚡⚡⚡ |
165
- | LLaVA 1.5 | 7B | 4K | | ⚡⚡ |
166
- | GPT-4V | - | 128K | | |
167
- | Qwen-VL | 7B | 32K | | ⚡⚡ |
168
 
169
- ## 📖 Citation
170
 
171
  \`\`\`bibtex
172
  @misc{rax4.5,
@@ -177,15 +176,10 @@ print(response.choices[0].message.content)
177
  }
178
  \`\`\`
179
 
180
- ## 📄 License
181
 
182
  Apache 2.0 - Free for commercial and research use
183
 
184
- ## 🔗 Links
185
-
186
- - [Model Card](https://huggingface.co/raxcore-dev/rax-3.5-chat)
187
- - [Raxcore GitHub](https://github.com/raxcore-dev)
188
-
189
  ---
190
 
191
- **Keywords**: vision language model, multimodal AI, image to text, VLM, computer vision, transformers, efficient LLM, 2B parameters, long context, production AI, visual question answering, image understanding, open source AI model
 
20
  inference: true
21
  ---
22
 
23
+ # Rax 4.5 - Efficient 2B Vision Language Model
24
 
25
+ Rax 4.5 is a state-of-the-art 2 billion parameter multimodal vision-language model optimized for production use. Process images and text together with up to 262K token context length.
26
 
27
+ ## Key Features
28
 
29
+ - Fast & Efficient: Only 2B parameters for quick inference
30
+ - Vision + Text: True multimodal understanding of images and language
31
+ - Long Context: 262,144 token context window for complex tasks
32
+ - Production Ready: Works with vLLM, SGLang, Transformers out of the box
33
+ - Memory Efficient: Hybrid attention architecture reduces VRAM usage
34
 
35
  ## Model Specifications
36
 
 
45
  | **Precision** | BFloat16 |
46
  | **License** | Apache 2.0 |
47
 
48
+ ## Capabilities
49
 
50
+ - Image Understanding: Analyze, describe, and answer questions about images
51
+ - Visual Question Answering: Extract information from screenshots, documents, charts
52
+ - Multimodal Reasoning: Combine visual and textual information for complex tasks
53
+ - Long Context Processing: Handle extensive documents with visual elements
54
+ - Production Deployment: Optimized for real-world applications
55
 
56
  ## Quick Start
57
 
 
99
  print(processor.decode(outputs[0], skip_special_tokens=True))
100
  \`\`\`
101
 
102
+ ### Deploy with vLLM
103
 
104
  \`\`\`bash
 
105
  vllm serve raxcore-dev/rax-3.5-chat --port 8000 --max-model-len 8192
106
  \`\`\`
107
 
 
123
  print(response.choices[0].message.content)
124
  \`\`\`
125
 
126
+ ## Architecture Details
127
 
128
+ - Hybrid Attention Mechanism: Alternates between linear and full attention for efficiency
129
+ - Vision Transformer: 24-layer encoder with 16x16 patch size, 2x2 spatial merging
130
+ - Optimized KV Cache: 2 key-value heads for 75% memory reduction
131
+ - Multi-Resolution Position Embeddings: Handles various image sizes and long sequences
132
+ - Cross-Modal Fusion: Advanced alignment between vision and language representations
133
 
134
+ ## Use Cases
135
 
136
+ - Document Analysis: Extract data from invoices, receipts, forms
137
+ - Visual QA Systems: Build AI that answers questions about images
138
+ - Content Moderation: Analyze images with contextual understanding
139
+ - Educational Tools: Explain diagrams, charts, and scientific images
140
+ - Accessibility: Generate detailed image descriptions for visually impaired users
141
+ - E-commerce: Product analysis and description generation
142
+ - Medical Imaging: Assist with image interpretation (not diagnostic)
143
 
144
+ ## Performance Tips
145
 
146
+ - Temperature: Use 0.6-0.8 for factual tasks, 0.8-1.0 for creative content
147
+ - Context Window: For >32K tokens, ensure 24GB+ VRAM
148
+ - Batch Processing: Process multiple images/texts together for efficiency
149
+ - Quantization: Use 4-bit/8-bit quantization for lower memory footprint
150
+ - GPU Requirements: Minimum 12GB VRAM (16GB recommended)
151
 
152
+ ## Limitations
153
 
154
  - 2B parameters may struggle with highly complex reasoning vs larger models
155
  - Vision encoder optimized for natural images (not specialized medical/satellite imagery)
156
  - Long context (>100K tokens) requires significant GPU memory
157
  - Not fine-tuned for specific domains without additional training
158
 
159
+ ## Model Comparison
160
 
161
  | Model | Params | Context | Multimodal | Speed |
162
  |-------|--------|---------|------------|-------|
163
+ | Rax 4.5 | 2B | 262K | Yes | Fast |
164
+ | LLaVA 1.5 | 7B | 4K | Yes | Medium |
165
+ | GPT-4V | - | 128K | Yes | Slow |
166
+ | Qwen-VL | 7B | 32K | Yes | Medium |
167
 
168
+ ## Citation
169
 
170
  \`\`\`bibtex
171
  @misc{rax4.5,
 
176
  }
177
  \`\`\`
178
 
179
+ ## License
180
 
181
  Apache 2.0 - Free for commercial and research use
182
 
 
 
 
 
 
183
  ---
184
 
185
+ Keywords: vision language model, multimodal AI, image to text, VLM, computer vision, transformers, efficient LLM, 2B parameters, long context, production AI, visual question answering, image understanding, open source AI model