File size: 3,428 Bytes
83dff40
2704a04
83dff40
2704a04
83dff40
2704a04
 
83dff40
 
 
fc7e729
 
2704a04
fc7e729
 
 
2704a04
 
 
 
 
 
 
a7fc81c
2704a04
a7fc81c
2704a04
 
 
 
fc7e729
 
 
2704a04
fc7e729
 
2704a04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc7e729
 
2704a04
fc7e729
2704a04
 
fc7e729
 
2704a04
 
 
 
 
 
 
 
 
 
 
 
fc7e729
2704a04
a7fc81c
2704a04
 
 
 
fc7e729
2704a04
fc7e729
2704a04
 
 
fc7e729
 
 
2704a04
 
 
fc7e729
2704a04
fc7e729
2704a04
fc7e729
 
 
 
2704a04
 
 
 
fc7e729
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- multimodal
- vision-language
- chat
---

# Rax 3.5 Chat

Rax 3.5 Chat is a compact 2B parameter multimodal model for vision-language understanding and conversational AI. It supports text and image inputs with extended context up to 262K tokens.

## Model Details

- **Parameters**: ~2B
- **Context Length**: 262,144 tokens
- **Input Modalities**: Text + Images
- **Attention**: Hybrid linear + full attention (24 layers)
- **Vision Encoder**: 24-layer transformer with 1024 hidden size
- **Text Hidden Size**: 2048
- **Precision**: BFloat16

## Key Features

- **Multimodal Understanding**: Processes text and images in unified reasoning
- **Long Context**: Supports up to 262K tokens for extended conversations
- **Efficient Architecture**: Hybrid attention mechanism for optimal performance
- **Production Ready**: Compatible with vLLM, SGLang, and Transformers

## Usage

### With Transformers

```python
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

model = AutoModelForVision2Seq.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True)

# Text-only conversation
messages = [{"role": "user", "content": "What is the capital of France?"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

# With image
image = Image.open("image.jpg")
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe this image."}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))
```

### With vLLM

```bash
vllm serve raxcore/Rax-3.5-Chat --port 8000 --max-model-len 8192
```

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="raxcore/Rax-3.5-Chat",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)
```

## Architecture Highlights

- **Hybrid Attention**: Alternates between linear attention and full attention layers for efficiency
- **Vision Encoder**: 24-layer transformer with patch size 16 and spatial merge 2x2
- **Efficient KV Cache**: 2 key-value heads for reduced memory footprint
- **Multi-resolution Position Embeddings**: Optimized for long-context understanding

## Best Practices

- Use temperature 0.6–0.8 for factual tasks, 0.8–1.0 for creative tasks
- For long context (>32K tokens), ensure sufficient GPU memory
- Enable trust_remote_code when loading the model

## Limitations

- 2B parameters may limit complex reasoning compared to larger models
- Vision understanding optimized for natural images
- Long context requires significant memory resources

## License

Apache 2.0

## Citation

```bibtex
@misc{rax3.5chat,
  title={Rax 3.5 Chat: Efficient Multimodal Assistant Model},
  author={Raxcore},
  year={2026}
}
```