Instructions to use raxcore-dev/Rax-4.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use raxcore-dev/Rax-4.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="raxcore-dev/Rax-4.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("raxcore-dev/Rax-4.5")
model = AutoModelForMultimodalLM.from_pretrained("raxcore-dev/Rax-4.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use raxcore-dev/Rax-4.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "raxcore-dev/Rax-4.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "raxcore-dev/Rax-4.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/raxcore-dev/Rax-4.5

SGLang

How to use raxcore-dev/Rax-4.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "raxcore-dev/Rax-4.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "raxcore-dev/Rax-4.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "raxcore-dev/Rax-4.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "raxcore-dev/Rax-4.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use raxcore-dev/Rax-4.5 with Docker Model Runner:
```
docker model run hf.co/raxcore-dev/Rax-4.5
```

raxder-ai commited on 9 days ago

Commit

04d684a

verified ·

1 Parent(s): 4465e83

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +44 -50

README.md CHANGED Viewed

@@ -20,17 +20,17 @@ tags:
 inference: true
 ---
-# Rax 4.5 - Efficient 2B Vision Language Model | Multimodal AI
-**Rax 4.5** is a state-of-the-art 2 billion parameter multimodal vision-language model optimized for production use. Process images and text together with up to 262K token context length.
-## 🚀 Why Rax 4.5?
-- **⚡ Fast & Efficient**: Only 2B parameters for quick inference
-- **🖼️ Vision + Text**: True multimodal understanding of images and language
-- **📏 Long Context**: 262,144 token context window for complex tasks
-- **🔧 Production Ready**: Works with vLLM, SGLang, Transformers out of the box
-- **💾 Memory Efficient**: Hybrid attention architecture reduces VRAM usage
 ## Model Specifications
@@ -45,13 +45,13 @@ inference: true
 | **Precision** | BFloat16 |
 | **License** | Apache 2.0 |
-## 🔥 Key Capabilities
-✅ **Image Understanding** - Analyze, describe, and answer questions about images
-✅ **Visual Question Answering** - Extract information from screenshots, documents, charts
-✅ **Multimodal Reasoning** - Combine visual and textual information for complex tasks
-✅ **Long Context Processing** - Handle extensive documents with visual elements
-✅ **Production Deployment** - Optimized for real-world applications
 ## Quick Start
@@ -99,10 +99,9 @@ outputs = model.generate(**inputs, max_new_tokens=512)
 print(processor.decode(outputs[0], skip_special_tokens=True))
 \`\`\`
-### Deploy with vLLM (High Performance)
 \`\`\`bash
-# Start vLLM server
 vllm serve raxcore-dev/rax-3.5-chat --port 8000 --max-model-len 8192
 \`\`\`
@@ -124,49 +123,49 @@ response = client.chat.completions.create(
 print(response.choices[0].message.content)
 \`\`\`
-## 🏗️ Architecture Details
-- **Hybrid Attention Mechanism**: Alternates between linear and full attention for efficiency
-- **Vision Transformer**: 24-layer encoder with 16x16 patch size, 2x2 spatial merging
-- **Optimized KV Cache**: 2 key-value heads for 75% memory reduction
-- **Multi-Resolution Position Embeddings**: Handles various image sizes and long sequences
-- **Cross-Modal Fusion**: Advanced alignment between vision and language representations
-## 📊 Use Cases
-- **Document Analysis**: Extract data from invoices, receipts, forms
-- **Visual QA Systems**: Build AI that answers questions about images
-- **Content Moderation**: Analyze images with contextual understanding
-- **Educational Tools**: Explain diagrams, charts, and scientific images
-- **Accessibility**: Generate detailed image descriptions for visually impaired users
-- **E-commerce**: Product analysis and description generation
-- **Medical Imaging**: Assist with image interpretation (not diagnostic)
-## ⚙️ Performance Tips
-- **Temperature**: Use 0.6-0.8 for factual tasks, 0.8-1.0 for creative content
-- **Context Window**: For >32K tokens, ensure 24GB+ VRAM
-- **Batch Processing**: Process multiple images/texts together for efficiency
-- **Quantization**: Use 4-bit/8-bit quantization for lower memory footprint
-- **GPU Requirements**: Minimum 12GB VRAM (16GB recommended)
-## 🚨 Limitations
 - 2B parameters may struggle with highly complex reasoning vs larger models
 - Vision encoder optimized for natural images (not specialized medical/satellite imagery)
 - Long context (>100K tokens) requires significant GPU memory
 - Not fine-tuned for specific domains without additional training
-## 🤝 Model Comparison
 | Model | Params | Context | Multimodal | Speed |
 |-------|--------|---------|------------|-------|
-| **Rax 4.5** | 2B | 262K | ✅ | ⚡⚡⚡ |
-| LLaVA 1.5 | 7B | 4K | ✅ | ⚡⚡ |
-| GPT-4V | - | 128K | ✅ | ⚡ |
-| Qwen-VL | 7B | 32K | ✅ | ⚡⚡ |
-## 📖 Citation
 \`\`\`bibtex
 @misc{rax4.5,
@@ -177,15 +176,10 @@ print(response.choices[0].message.content)
 }
 \`\`\`
-## 📄 License
 Apache 2.0 - Free for commercial and research use
-## 🔗 Links
-- [Model Card](https://huggingface.co/raxcore-dev/rax-3.5-chat)
-- [Raxcore GitHub](https://github.com/raxcore-dev)
 ---
-**Keywords**: vision language model, multimodal AI, image to text, VLM, computer vision, transformers, efficient LLM, 2B parameters, long context, production AI, visual question answering, image understanding, open source AI model

 inference: true
 ---
+# Rax 4.5 - Efficient 2B Vision Language Model
+Rax 4.5 is a state-of-the-art 2 billion parameter multimodal vision-language model optimized for production use. Process images and text together with up to 262K token context length.
+## Key Features
+- Fast & Efficient: Only 2B parameters for quick inference
+- Vision + Text: True multimodal understanding of images and language
+- Long Context: 262,144 token context window for complex tasks
+- Production Ready: Works with vLLM, SGLang, Transformers out of the box
+- Memory Efficient: Hybrid attention architecture reduces VRAM usage
 ## Model Specifications
 | **Precision** | BFloat16 |
 | **License** | Apache 2.0 |
+## Capabilities
+- Image Understanding: Analyze, describe, and answer questions about images
+- Visual Question Answering: Extract information from screenshots, documents, charts
+- Multimodal Reasoning: Combine visual and textual information for complex tasks
+- Long Context Processing: Handle extensive documents with visual elements
+- Production Deployment: Optimized for real-world applications
 ## Quick Start
 print(processor.decode(outputs[0], skip_special_tokens=True))
 \`\`\`
+### Deploy with vLLM
 \`\`\`bash
 vllm serve raxcore-dev/rax-3.5-chat --port 8000 --max-model-len 8192
 \`\`\`
 print(response.choices[0].message.content)
 \`\`\`
+## Architecture Details
+- Hybrid Attention Mechanism: Alternates between linear and full attention for efficiency
+- Vision Transformer: 24-layer encoder with 16x16 patch size, 2x2 spatial merging
+- Optimized KV Cache: 2 key-value heads for 75% memory reduction
+- Multi-Resolution Position Embeddings: Handles various image sizes and long sequences
+- Cross-Modal Fusion: Advanced alignment between vision and language representations
+## Use Cases
+- Document Analysis: Extract data from invoices, receipts, forms
+- Visual QA Systems: Build AI that answers questions about images
+- Content Moderation: Analyze images with contextual understanding
+- Educational Tools: Explain diagrams, charts, and scientific images
+- Accessibility: Generate detailed image descriptions for visually impaired users
+- E-commerce: Product analysis and description generation
+- Medical Imaging: Assist with image interpretation (not diagnostic)
+## Performance Tips
+- Temperature: Use 0.6-0.8 for factual tasks, 0.8-1.0 for creative content
+- Context Window: For >32K tokens, ensure 24GB+ VRAM
+- Batch Processing: Process multiple images/texts together for efficiency
+- Quantization: Use 4-bit/8-bit quantization for lower memory footprint
+- GPU Requirements: Minimum 12GB VRAM (16GB recommended)
+## Limitations
 - 2B parameters may struggle with highly complex reasoning vs larger models
 - Vision encoder optimized for natural images (not specialized medical/satellite imagery)
 - Long context (>100K tokens) requires significant GPU memory
 - Not fine-tuned for specific domains without additional training
+## Model Comparison
 | Model | Params | Context | Multimodal | Speed |
 |-------|--------|---------|------------|-------|
+| Rax 4.5 | 2B | 262K | Yes | Fast |
+| LLaVA 1.5 | 7B | 4K | Yes | Medium |
+| GPT-4V | - | 128K | Yes | Slow |
+| Qwen-VL | 7B | 32K | Yes | Medium |
+## Citation
 \`\`\`bibtex
 @misc{rax4.5,
 }
 \`\`\`
+## License
 Apache 2.0 - Free for commercial and research use
 ---
+Keywords: vision language model, multimodal AI, image to text, VLM, computer vision, transformers, efficient LLM, 2B parameters, long context, production AI, visual question answering, image understanding, open source AI model