|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- sovthpaw/senter-omni-data |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-Omni-3B |
|
|
pipeline_tag: any-to-any |
|
|
--- |
|
|
<div align="center"> |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
๐ค๐ค |
|
|
|
|
|
</div> |
|
|
|
|
|
**๐ฏ ONE MODEL, ALL MODALITIES, CHAT & EMBED** - Unlike pipeline approaches, Senter-Omni is a single 4B parameter model that truly understands and reasons across text, images, audio, and video simultaneously. |
|
|
|
|
|
**๐ OPEN & UNCENSORED** - Apache 2.0 licensed with unrestricted responses for maximum utility. |
|
|
|
|
|
**๐ง 128K CONTEXT** - Extended RoPE scaling for handling massive documents and conversations. |
|
|
|
|
|
**๐พ MEMORY EFFICIENT** - 4-bit quantized model that fits on consumer GPUs while maintaining full multimodal capabilities. |
|
|
|
|
|
--- |
|
|
</div> |
|
|
|
|
|
## ๐ **Quick Start** |
|
|
|
|
|
### **Installation** |
|
|
```bash |
|
|
git clone https://github.com/SouthpawIN/senter-omni.git |
|
|
cd senter-omni |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Download the quantized model (instructions below) |
|
|
# Then run the demo: |
|
|
python senter_omni_demo.py |
|
|
``` |
|
|
|
|
|
### **Basic Usage** |
|
|
```python |
|
|
from omni import OmniClient |
|
|
|
|
|
# Initialize Senter-Omni |
|
|
client = OmniClient() |
|
|
|
|
|
# Streaming chat |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": "Hello Senter!"} |
|
|
], stream=True) |
|
|
|
|
|
# Multimodal chat with image |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": [ |
|
|
{"type": "image", "image": "photo.jpg"}, |
|
|
{"type": "text", "text": "What do you see?"} |
|
|
]} |
|
|
]) |
|
|
|
|
|
# Cross-modal embeddings |
|
|
embedding = client.embed("any content", modality="auto") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ญ **Multimodal Capabilities** |
|
|
|
|
|
### **Text Understanding & Generation** |
|
|
- **Mathematical Reasoning**: Step-by-step problem solving |
|
|
- **Code Generation**: Python, JavaScript, and more |
|
|
- **Creative Writing**: Stories, scripts, poetry |
|
|
- **Technical Analysis**: Complex explanations and documentation |
|
|
|
|
|
### **Visual Understanding** |
|
|
- **Image Analysis**: Detailed descriptions of visual content |
|
|
- **Geometric Recognition**: Shapes, colors, spatial relationships |
|
|
- **Creative Interpretation**: Stories inspired by images |
|
|
- **Technical Diagrams**: Understanding charts, graphs, schematics |
|
|
|
|
|
### **Audio Processing** |
|
|
- **Sound Analysis**: Identifying audio content and patterns |
|
|
- **Speech Understanding**: Transcribing and interpreting spoken content |
|
|
- **Music Analysis**: Recognizing musical elements and genres |
|
|
- **Environmental Audio**: Identifying sounds from various sources |
|
|
|
|
|
### **Cross-Modal Reasoning** |
|
|
- **Unified Understanding**: Connecting information across modalities |
|
|
- **Contextual Analysis**: Using multiple inputs for better reasoning |
|
|
- **Creative Synthesis**: Combining visual, audio, and text for rich responses |
|
|
|
|
|
### **Model Specifications** |
|
|
- **Parameters**: 4B (quantized to 4-bit) |
|
|
- **Context Length**: 128K tokens (RoPE scaled) |
|
|
- **Memory Usage**: ~8GB VRAM |
|
|
- **Inference Speed**: Real-time streaming |
|
|
- **Modalities**: Text, Image, Audio, Video |
|
|
|
|
|
### **Embedding Capabilities** |
|
|
- **Unified Space**: 1024D embeddings for all modalities |
|
|
- **Cross-Modal Search**: Find similar content across text, images, audio |
|
|
- **Similarity Matching**: Cosine similarity in unified space |
|
|
- **Memory Efficient**: Same model for chat and embeddings |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฏ **Real Examples** |
|
|
|
|
|
### **Image Analysis** |
|
|
```python |
|
|
# Analyze geometric shapes |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": [ |
|
|
{"type": "image", "image": "test_assets/real_test_image.jpg"}, |
|
|
{"type": "text", "text": "What geometric shapes do you see?"} |
|
|
]} |
|
|
]) |
|
|
|
|
|
# Output: "I see a red square, blue square, and green oval arranged vertically" |
|
|
``` |
|
|
|
|
|
### **Audio Understanding** |
|
|
```python |
|
|
# Process audio content |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": [ |
|
|
{"type": "audio", "audio": "test_assets/real_test_audio.wav"}, |
|
|
{"type": "text", "text": "What do you hear?"} |
|
|
]} |
|
|
]) |
|
|
|
|
|
# Output: "I hear an electric hum from a device like a radio or TV" |
|
|
``` |
|
|
|
|
|
### **Creative Multimodal Storytelling** |
|
|
```python |
|
|
# Create stories from images |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": [ |
|
|
{"type": "image", "image": "shapes.jpg"}, |
|
|
{"type": "text", "text": "Create a story inspired by this image"} |
|
|
]} |
|
|
]) |
|
|
|
|
|
# Output: Rich, creative stories combining visual elements with narrative |
|
|
``` |
|
|
|
|
|
### **Cross-Modal Embeddings** |
|
|
```python |
|
|
# Embed different modalities |
|
|
text_emb = client.embed("beautiful mountain landscape") |
|
|
image_emb = client.embed("mountain_photo.jpg", modality="image") |
|
|
audio_emb = client.embed("nature_sounds.wav", modality="audio") |
|
|
|
|
|
# All embeddings are in the same 1024D space for comparison |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง **Technical Architecture** |
|
|
|
|
|
### **Model Details** |
|
|
- **Base**: Qwen2.5-Omni-3B (Apache 2.0 licensed) |
|
|
- **Quantization**: 4-bit NF4 for memory efficiency |
|
|
- **Context Extension**: Yarn RoPE scaling to 128K |
|
|
- **Streaming**: Custom TimingStreamer for real-time output |
|
|
- **Embeddings**: Hash-based unified 1024D space |
|
|
|
|
|
### **Training Data** |
|
|
- **131,893 samples** from multiple high-quality datasets: |
|
|
- 50,000 ShareGPT conversations (chat) |
|
|
- 30,000 AgentCode samples (function calling) |
|
|
- 20,000 Stack Overflow (coding) |
|
|
- 30,000 Hermes-3 (instruction tuning) |
|
|
- 1,893 Hermes function calling |
|
|
|
|
|
### **Key Features** |
|
|
- **XML Tag Support**: `<think>`, `<notepad>`, `<system>`, `<user>`, `<assistant>` |
|
|
- **Uncensored Responses**: No content restrictions |
|
|
- **Function Calling**: Tool integration capabilities |
|
|
- **Memory Efficient**: Single model for chat and embeddings |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฆ **Installation & Setup** |
|
|
|
|
|
### **1. Clone Repository** |
|
|
```bash |
|
|
git clone https://github.com/SouthpawIN/senter-omni.git |
|
|
cd senter-omni |
|
|
``` |
|
|
|
|
|
### **2. Install Dependencies** |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### **3. Download Model** |
|
|
The quantized model (3.5GB) is hosted on Hugging Face due to GitHub's 100MB file limit: |
|
|
|
|
|
- **Dataset**: https://huggingface.co/datasets/SouthpawIN/senter-omni-data |
|
|
|
|
|
```bash |
|
|
# Option 1: Download from Hugging Face (Recommended) |
|
|
git lfs install |
|
|
git clone https://huggingface.co/SouthpawIN/senter-omni-model |
|
|
cp -r senter-omni-model/* ./senter_omni_128k/ |
|
|
|
|
|
# Option 2: Manual download |
|
|
# Download from: https://huggingface.co/SouthpawIN/senter-omni-model |
|
|
``` |
|
|
|
|
|
## ๐ฎ **Interactive Demo** |
|
|
|
|
|
The comprehensive demo showcases all capabilities: |
|
|
|
|
|
```bash |
|
|
python senter_omni_demo.py |
|
|
``` |
|
|
|
|
|
**Demo Sections:** |
|
|
1. **๐ Training Capabilities** - Dataset overview and training features |
|
|
2. **๐ฌ Multimodal Chat** - Text, image, audio, and combined processing |
|
|
3. **๐ Cross-Modal Embeddings** - Unified embedding space demonstration |
|
|
4. **๐ Building Guide** - API usage and integration examples |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ ๏ธ **API Reference** |
|
|
|
|
|
### **Core Methods** |
|
|
|
|
|
#### **`client.chat(messages, **kwargs)`** |
|
|
```python |
|
|
# Basic chat |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": "Hello!"} |
|
|
]) |
|
|
|
|
|
# With parameters |
|
|
response = client.chat( |
|
|
messages=[{"role": "user", "content": "Hello!"}], |
|
|
max_tokens=256, |
|
|
temperature=0.7, |
|
|
stream=True |
|
|
) |
|
|
|
|
|
# Multimodal |
|
|
response = client.chat([ |
|
|
{"role": "user", "content": [ |
|
|
{"type": "image", "image": "photo.jpg"}, |
|
|
{"type": "text", "text": "Describe this image"} |
|
|
]} |
|
|
]) |
|
|
``` |
|
|
|
|
|
#### **`client.embed(content, modality="auto")`** |
|
|
```python |
|
|
# Text embedding |
|
|
emb = client.embed("sample text") |
|
|
|
|
|
# Image embedding |
|
|
emb = client.embed("image.jpg", modality="image") |
|
|
|
|
|
# Audio embedding |
|
|
emb = client.embed("audio.wav", modality="audio") |
|
|
|
|
|
# Auto-detect modality |
|
|
emb = client.embed("[IMAGE] photo.jpg") # Detects as image |
|
|
``` |
|
|
|
|
|
#### **`client.cross_search(query, top_k=5)`** |
|
|
```python |
|
|
# Search across modalities |
|
|
results = client.cross_search("mountain landscape") |
|
|
# Returns: {"text": [...], "image": [...], "audio": [...]} |
|
|
``` |
|
|
|
|
|
#### **`client.retrieve_context(query, context_window=5)`** |
|
|
```python |
|
|
# Get relevant context |
|
|
context = client.retrieve_context("nature scenes") |
|
|
# Returns multimodal context items |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### **Memory Usage** |
|
|
- **Model Loading**: ~8GB VRAM |
|
|
- **Inference**: ~10GB VRAM peak |
|
|
- **Embeddings**: Shared model (no additional memory) |
|
|
- **Context (128K)**: ~2GB additional for full context |
|
|
|
|
|
### **Development Setup** |
|
|
```bash |
|
|
git clone https://github.com/SouthpawIN/senter-omni.git |
|
|
cd senter-omni |
|
|
pip install -r requirements.txt |
|
|
python senter_omni_demo.py # Test installation |
|
|
``` |
|
|
--- |
|
|
|
|
|
## ๐ **License** |
|
|
|
|
|
**Apache 2.0 License** - See [LICENSE](LICENSE) for details. |
|
|
|
|
|
This project uses: |
|
|
- **Qwen2.5-Omni**: Apache 2.0 (Alibaba Cloud) |
|
|
- **Training Datasets**: Various open licenses |
|
|
- **Code**: Apache 2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ **Acknowledgments** |
|
|
|
|
|
- **Alibaba Cloud** for Qwen2.5-Omni architecture |
|
|
- **Nous Research** for Hermes dataset and inspiration |
|
|
- **Alignment Lab AI** for development and training |
|
|
- **Unsloth** for efficient training framework |
|
|
- **HuggingFace** for model hosting and tools |
|
|
- **Open Source Community** for datasets and tools |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**๐ญ EXPERIENCE THE FUTURE OF MULTIMODAL AI WITH SENTER-OMNI** |
|
|
|
|
|
*Built with โค๏ธ by sovthpaw at Alignment Lab AI* |
|
|
|
|
|
Donations: |
|
|
|
|
|
https://www.paypal.me/Sellgames1l |
|
|
</div> |