raxder-ai commited on
Commit
4465e83
Β·
verified Β·
1 Parent(s): 7f957d0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +139 -57
README.md CHANGED
@@ -5,105 +5,187 @@ pipeline_tag: image-text-to-text
5
  tags:
6
  - multimodal
7
  - vision-language
8
- - chat
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- # Rax 3.5 Chat
12
 
13
- Rax 3.5 Chat is a compact 2B parameter multimodal model for vision-language understanding and conversational AI. It supports text and image inputs with extended context up to 262K tokens.
14
 
15
- ## Model Details
16
 
17
- - **Parameters**: ~2B
18
- - **Context Length**: 262,144 tokens
19
- - **Input Modalities**: Text + Images
20
- - **Attention**: Hybrid linear + full attention (24 layers)
21
- - **Vision Encoder**: 24-layer transformer with 1024 hidden size
22
- - **Text Hidden Size**: 2048
23
- - **Precision**: BFloat16
24
 
25
- ## Key Features
26
 
27
- - **Multimodal Understanding**: Processes text and images in unified reasoning
28
- - **Long Context**: Supports up to 262K tokens for extended conversations
29
- - **Efficient Architecture**: Hybrid attention mechanism for optimal performance
30
- - **Production Ready**: Compatible with vLLM, SGLang, and Transformers
 
 
 
 
 
 
31
 
32
- ## Usage
33
 
34
- ### With Transformers
 
 
 
 
35
 
36
- ```python
 
 
 
 
 
 
 
 
 
 
37
  from transformers import AutoModelForVision2Seq, AutoProcessor
38
  from PIL import Image
39
 
40
- model = AutoModelForVision2Seq.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True)
41
- processor = AutoProcessor.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True)
 
 
 
 
 
 
 
42
 
43
- # Text-only conversation
44
- messages = [{"role": "user", "content": "What is the capital of France?"}]
45
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
46
  inputs = processor(text=text, return_tensors="pt")
47
  outputs = model.generate(**inputs, max_new_tokens=512)
48
  print(processor.decode(outputs[0], skip_special_tokens=True))
49
 
50
- # With image
51
- image = Image.open("image.jpg")
52
- messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
53
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
54
  inputs = processor(text=text, images=image, return_tensors="pt")
55
  outputs = model.generate(**inputs, max_new_tokens=512)
56
  print(processor.decode(outputs[0], skip_special_tokens=True))
57
- ```
58
 
59
- ### With vLLM
60
 
61
- ```bash
62
- vllm serve raxcore/Rax-3.5-Chat --port 8000 --max-model-len 8192
63
- ```
 
64
 
65
- ```python
66
  from openai import OpenAI
 
67
  client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
68
 
69
  response = client.chat.completions.create(
70
- model="raxcore/Rax-3.5-Chat",
71
- messages=[{"role": "user", "content": "Hello!"}],
 
 
 
72
  temperature=0.7,
73
- max_tokens=512
74
  )
 
75
  print(response.choices[0].message.content)
76
- ```
 
 
77
 
78
- ## Architecture Highlights
 
 
 
 
79
 
80
- - **Hybrid Attention**: Alternates between linear attention and full attention layers for efficiency
81
- - **Vision Encoder**: 24-layer transformer with patch size 16 and spatial merge 2x2
82
- - **Efficient KV Cache**: 2 key-value heads for reduced memory footprint
83
- - **Multi-resolution Position Embeddings**: Optimized for long-context understanding
84
 
85
- ## Best Practices
 
 
 
 
 
 
86
 
87
- - Use temperature 0.6–0.8 for factual tasks, 0.8–1.0 for creative tasks
88
- - For long context (>32K tokens), ensure sufficient GPU memory
89
- - Enable trust_remote_code when loading the model
90
 
91
- ## Limitations
 
 
 
 
92
 
93
- - 2B parameters may limit complex reasoning compared to larger models
94
- - Vision understanding optimized for natural images
95
- - Long context requires significant memory resources
96
 
97
- ## License
 
 
 
98
 
99
- Apache 2.0
100
 
101
- ## Citation
 
 
 
 
 
102
 
103
- ```bibtex
104
- @misc{rax3.5chat,
105
- title={Rax 3.5 Chat: Efficient Multimodal Assistant Model},
 
 
106
  author={Raxcore},
107
- year={2026}
 
108
  }
109
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - multimodal
7
  - vision-language
8
+ - vision
9
+ - image-to-text
10
+ - llm
11
+ - vision-language-model
12
+ - computer-vision
13
+ - deep-learning
14
+ - pytorch
15
+ - transformers
16
+ - vlm
17
+ - 2b
18
+ - efficient
19
+ - production
20
+ inference: true
21
  ---
22
 
23
+ # Rax 4.5 - Efficient 2B Vision Language Model | Multimodal AI
24
 
25
+ **Rax 4.5** is a state-of-the-art 2 billion parameter multimodal vision-language model optimized for production use. Process images and text together with up to 262K token context length.
26
 
27
+ ## πŸš€ Why Rax 4.5?
28
 
29
+ - **⚑ Fast & Efficient**: Only 2B parameters for quick inference
30
+ - **πŸ–ΌοΈ Vision + Text**: True multimodal understanding of images and language
31
+ - **πŸ“ Long Context**: 262,144 token context window for complex tasks
32
+ - **πŸ”§ Production Ready**: Works with vLLM, SGLang, Transformers out of the box
33
+ - **πŸ’Ύ Memory Efficient**: Hybrid attention architecture reduces VRAM usage
 
 
34
 
35
+ ## Model Specifications
36
 
37
+ | Feature | Details |
38
+ |---------|---------|
39
+ | **Parameters** | ~2 Billion |
40
+ | **Context Length** | 262,144 tokens |
41
+ | **Input Types** | Text + Images |
42
+ | **Architecture** | Hybrid Linear + Full Attention (24 layers) |
43
+ | **Vision Encoder** | 24-layer ViT, 1024 hidden size |
44
+ | **Text Hidden Size** | 2048 |
45
+ | **Precision** | BFloat16 |
46
+ | **License** | Apache 2.0 |
47
 
48
+ ## πŸ”₯ Key Capabilities
49
 
50
+ βœ… **Image Understanding** - Analyze, describe, and answer questions about images
51
+ βœ… **Visual Question Answering** - Extract information from screenshots, documents, charts
52
+ βœ… **Multimodal Reasoning** - Combine visual and textual information for complex tasks
53
+ βœ… **Long Context Processing** - Handle extensive documents with visual elements
54
+ βœ… **Production Deployment** - Optimized for real-world applications
55
 
56
+ ## Quick Start
57
+
58
+ ### Installation
59
+
60
+ \`\`\`bash
61
+ pip install transformers pillow torch accelerate
62
+ \`\`\`
63
+
64
+ ### Basic Usage with Transformers
65
+
66
+ \`\`\`python
67
  from transformers import AutoModelForVision2Seq, AutoProcessor
68
  from PIL import Image
69
 
70
+ # Load model
71
+ model = AutoModelForVision2Seq.from_pretrained(
72
+ "raxcore-dev/rax-3.5-chat",
73
+ trust_remote_code=True
74
+ )
75
+ processor = AutoProcessor.from_pretrained(
76
+ "raxcore-dev/rax-3.5-chat",
77
+ trust_remote_code=True
78
+ )
79
 
80
+ # Text generation
81
+ messages = [{"role": "user", "content": "Explain quantum computing"}]
82
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
83
  inputs = processor(text=text, return_tensors="pt")
84
  outputs = model.generate(**inputs, max_new_tokens=512)
85
  print(processor.decode(outputs[0], skip_special_tokens=True))
86
 
87
+ # Image analysis
88
+ image = Image.open("photo.jpg")
89
+ messages = [{
90
+ "role": "user",
91
+ "content": [
92
+ {"type": "image"},
93
+ {"type": "text", "text": "What's in this image? Be detailed."}
94
+ ]
95
+ }]
96
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
  inputs = processor(text=text, images=image, return_tensors="pt")
98
  outputs = model.generate(**inputs, max_new_tokens=512)
99
  print(processor.decode(outputs[0], skip_special_tokens=True))
100
+ \`\`\`
101
 
102
+ ### Deploy with vLLM (High Performance)
103
 
104
+ \`\`\`bash
105
+ # Start vLLM server
106
+ vllm serve raxcore-dev/rax-3.5-chat --port 8000 --max-model-len 8192
107
+ \`\`\`
108
 
109
+ \`\`\`python
110
  from openai import OpenAI
111
+
112
  client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
113
 
114
  response = client.chat.completions.create(
115
+ model="raxcore-dev/rax-3.5-chat",
116
+ messages=[
117
+ {"role": "system", "content": "You are a helpful AI assistant."},
118
+ {"role": "user", "content": "Write a Python function to sort a list."}
119
+ ],
120
  temperature=0.7,
121
+ max_tokens=1024
122
  )
123
+
124
  print(response.choices[0].message.content)
125
+ \`\`\`
126
+
127
+ ## πŸ—οΈ Architecture Details
128
 
129
+ - **Hybrid Attention Mechanism**: Alternates between linear and full attention for efficiency
130
+ - **Vision Transformer**: 24-layer encoder with 16x16 patch size, 2x2 spatial merging
131
+ - **Optimized KV Cache**: 2 key-value heads for 75% memory reduction
132
+ - **Multi-Resolution Position Embeddings**: Handles various image sizes and long sequences
133
+ - **Cross-Modal Fusion**: Advanced alignment between vision and language representations
134
 
135
+ ## πŸ“Š Use Cases
 
 
 
136
 
137
+ - **Document Analysis**: Extract data from invoices, receipts, forms
138
+ - **Visual QA Systems**: Build AI that answers questions about images
139
+ - **Content Moderation**: Analyze images with contextual understanding
140
+ - **Educational Tools**: Explain diagrams, charts, and scientific images
141
+ - **Accessibility**: Generate detailed image descriptions for visually impaired users
142
+ - **E-commerce**: Product analysis and description generation
143
+ - **Medical Imaging**: Assist with image interpretation (not diagnostic)
144
 
145
+ ## βš™οΈ Performance Tips
 
 
146
 
147
+ - **Temperature**: Use 0.6-0.8 for factual tasks, 0.8-1.0 for creative content
148
+ - **Context Window**: For >32K tokens, ensure 24GB+ VRAM
149
+ - **Batch Processing**: Process multiple images/texts together for efficiency
150
+ - **Quantization**: Use 4-bit/8-bit quantization for lower memory footprint
151
+ - **GPU Requirements**: Minimum 12GB VRAM (16GB recommended)
152
 
153
+ ## 🚨 Limitations
 
 
154
 
155
+ - 2B parameters may struggle with highly complex reasoning vs larger models
156
+ - Vision encoder optimized for natural images (not specialized medical/satellite imagery)
157
+ - Long context (>100K tokens) requires significant GPU memory
158
+ - Not fine-tuned for specific domains without additional training
159
 
160
+ ## 🀝 Model Comparison
161
 
162
+ | Model | Params | Context | Multimodal | Speed |
163
+ |-------|--------|---------|------------|-------|
164
+ | **Rax 4.5** | 2B | 262K | βœ… | ⚑⚑⚑ |
165
+ | LLaVA 1.5 | 7B | 4K | βœ… | ⚑⚑ |
166
+ | GPT-4V | - | 128K | βœ… | ⚑ |
167
+ | Qwen-VL | 7B | 32K | βœ… | ⚑⚑ |
168
 
169
+ ## πŸ“– Citation
170
+
171
+ \`\`\`bibtex
172
+ @misc{rax4.5,
173
+ title={Rax 4.5: Efficient Multimodal Vision-Language Model},
174
  author={Raxcore},
175
+ year={2026},
176
+ url={https://huggingface.co/raxcore-dev/rax-3.5-chat}
177
  }
178
+ \`\`\`
179
+
180
+ ## πŸ“„ License
181
+
182
+ Apache 2.0 - Free for commercial and research use
183
+
184
+ ## πŸ”— Links
185
+
186
+ - [Model Card](https://huggingface.co/raxcore-dev/rax-3.5-chat)
187
+ - [Raxcore GitHub](https://github.com/raxcore-dev)
188
+
189
+ ---
190
+
191
+ **Keywords**: vision language model, multimodal AI, image to text, VLM, computer vision, transformers, efficient LLM, 2B parameters, long context, production AI, visual question answering, image understanding, open source AI model