senter-omni-model / README.md

Update README.md

46a4972 verified 6 months ago

9.01 kB

	---
	license: apache-2.0
	datasets:
	- sovthpaw/senter-omni-data
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-Omni-3B
	pipeline_tag: any-to-any
	---
	<div align="center">

	![Alt Text](senter-fixed-banner.gif)


	🤘🤖

	</div>

	🎯 ONE MODEL, ALL MODALITIES, CHAT & EMBED - Unlike pipeline approaches, Senter-Omni is a single 4B parameter model that truly understands and reasons across text, images, audio, and video simultaneously.

	🔓 OPEN & UNCENSORED - Apache 2.0 licensed with unrestricted responses for maximum utility.

	🧠 128K CONTEXT - Extended RoPE scaling for handling massive documents and conversations.

	💾 MEMORY EFFICIENT - 4-bit quantized model that fits on consumer GPUs while maintaining full multimodal capabilities.

	---
	</div>

	## 🚀 Quick Start

	### Installation
	```bash
	git clone https://github.com/SouthpawIN/senter-omni.git
	cd senter-omni
	pip install -r requirements.txt

	# Download the quantized model (instructions below)
	# Then run the demo:
	python senter_omni_demo.py
	```

	### Basic Usage
	```python
	from omni import OmniClient

	# Initialize Senter-Omni
	client = OmniClient()

	# Streaming chat
	response = client.chat([
	{"role": "user", "content": "Hello Senter!"}
	], stream=True)

	# Multimodal chat with image
	response = client.chat([
	{"role": "user", "content": [
	{"type": "image", "image": "photo.jpg"},
	{"type": "text", "text": "What do you see?"}
	]}
	])

	# Cross-modal embeddings
	embedding = client.embed("any content", modality="auto")
	```

	---

	## 🎭 Multimodal Capabilities

	### Text Understanding & Generation
	- Mathematical Reasoning: Step-by-step problem solving
	- Code Generation: Python, JavaScript, and more
	- Creative Writing: Stories, scripts, poetry
	- Technical Analysis: Complex explanations and documentation

	### Visual Understanding
	- Image Analysis: Detailed descriptions of visual content
	- Geometric Recognition: Shapes, colors, spatial relationships
	- Creative Interpretation: Stories inspired by images
	- Technical Diagrams: Understanding charts, graphs, schematics

	### Audio Processing
	- Sound Analysis: Identifying audio content and patterns
	- Speech Understanding: Transcribing and interpreting spoken content
	- Music Analysis: Recognizing musical elements and genres
	- Environmental Audio: Identifying sounds from various sources

	### Cross-Modal Reasoning
	- Unified Understanding: Connecting information across modalities
	- Contextual Analysis: Using multiple inputs for better reasoning
	- Creative Synthesis: Combining visual, audio, and text for rich responses

	### Model Specifications
	- Parameters: 4B (quantized to 4-bit)
	- Context Length: 128K tokens (RoPE scaled)
	- Memory Usage: ~8GB VRAM
	- Inference Speed: Real-time streaming
	- Modalities: Text, Image, Audio, Video

	### Embedding Capabilities
	- Unified Space: 1024D embeddings for all modalities
	- Cross-Modal Search: Find similar content across text, images, audio
	- Similarity Matching: Cosine similarity in unified space
	- Memory Efficient: Same model for chat and embeddings

	---

	## 🎯 Real Examples

	### Image Analysis
	```python
	# Analyze geometric shapes
	response = client.chat([
	{"role": "user", "content": [
	{"type": "image", "image": "test_assets/real_test_image.jpg"},
	{"type": "text", "text": "What geometric shapes do you see?"}
	]}
	])

	# Output: "I see a red square, blue square, and green oval arranged vertically"
	```

	### Audio Understanding
	```python
	# Process audio content
	response = client.chat([
	{"role": "user", "content": [
	{"type": "audio", "audio": "test_assets/real_test_audio.wav"},
	{"type": "text", "text": "What do you hear?"}
	]}
	])

	# Output: "I hear an electric hum from a device like a radio or TV"
	```

	### Creative Multimodal Storytelling
	```python
	# Create stories from images
	response = client.chat([
	{"role": "user", "content": [
	{"type": "image", "image": "shapes.jpg"},
	{"type": "text", "text": "Create a story inspired by this image"}
	]}
	])

	# Output: Rich, creative stories combining visual elements with narrative
	```

	### Cross-Modal Embeddings
	```python
	# Embed different modalities
	text_emb = client.embed("beautiful mountain landscape")
	image_emb = client.embed("mountain_photo.jpg", modality="image")
	audio_emb = client.embed("nature_sounds.wav", modality="audio")

	# All embeddings are in the same 1024D space for comparison
	```

	---

	## 🔧 Technical Architecture

	### Model Details
	- Base: Qwen2.5-Omni-3B (Apache 2.0 licensed)
	- Quantization: 4-bit NF4 for memory efficiency
	- Context Extension: Yarn RoPE scaling to 128K
	- Streaming: Custom TimingStreamer for real-time output
	- Embeddings: Hash-based unified 1024D space

	### Training Data
	- 131,893 samples from multiple high-quality datasets:
	- 50,000 ShareGPT conversations (chat)
	- 30,000 AgentCode samples (function calling)
	- 20,000 Stack Overflow (coding)
	- 30,000 Hermes-3 (instruction tuning)
	- 1,893 Hermes function calling

	### Key Features
	- XML Tag Support: `<think>`, `<notepad>`, `<system>`, `<user>`, `<assistant>`
	- Uncensored Responses: No content restrictions
	- Function Calling: Tool integration capabilities
	- Memory Efficient: Single model for chat and embeddings

	---

	## 📦 Installation & Setup

	### 1. Clone Repository
	```bash
	git clone https://github.com/SouthpawIN/senter-omni.git
	cd senter-omni
	```

	### 2. Install Dependencies
	```bash
	pip install -r requirements.txt
	```

	### 3. Download Model
	The quantized model (3.5GB) is hosted on Hugging Face due to GitHub's 100MB file limit:

	- Dataset: https://huggingface.co/datasets/SouthpawIN/senter-omni-data

	```bash
	# Option 1: Download from Hugging Face (Recommended)
	git lfs install
	git clone https://huggingface.co/SouthpawIN/senter-omni-model
	cp -r senter-omni-model/* ./senter_omni_128k/

	# Option 2: Manual download
	# Download from: https://huggingface.co/SouthpawIN/senter-omni-model
	```

	## 🎮 Interactive Demo

	The comprehensive demo showcases all capabilities:

	```bash
	python senter_omni_demo.py
	```

	Demo Sections:
	1. 🎓 Training Capabilities - Dataset overview and training features
	2. 💬 Multimodal Chat - Text, image, audio, and combined processing
	3. 🔍 Cross-Modal Embeddings - Unified embedding space demonstration
	4. 🚀 Building Guide - API usage and integration examples

	---

	## 🛠️ API Reference

	### Core Methods

	#### `client.chat(messages, kwargs)`**
	```python
	# Basic chat
	response = client.chat([
	{"role": "user", "content": "Hello!"}
	])

	# With parameters
	response = client.chat(
	messages=[{"role": "user", "content": "Hello!"}],
	max_tokens=256,
	temperature=0.7,
	stream=True
	)

	# Multimodal
	response = client.chat([
	{"role": "user", "content": [
	{"type": "image", "image": "photo.jpg"},
	{"type": "text", "text": "Describe this image"}
	]}
	])
	```

	#### `client.embed(content, modality="auto")`
	```python
	# Text embedding
	emb = client.embed("sample text")

	# Image embedding
	emb = client.embed("image.jpg", modality="image")

	# Audio embedding
	emb = client.embed("audio.wav", modality="audio")

	# Auto-detect modality
	emb = client.embed("[IMAGE] photo.jpg") # Detects as image
	```

	#### `client.cross_search(query, top_k=5)`
	```python
	# Search across modalities
	results = client.cross_search("mountain landscape")
	# Returns: {"text": [...], "image": [...], "audio": [...]}
	```

	#### `client.retrieve_context(query, context_window=5)`
	```python
	# Get relevant context
	context = client.retrieve_context("nature scenes")
	# Returns multimodal context items
	```

	---

	### Memory Usage
	- Model Loading: ~8GB VRAM
	- Inference: ~10GB VRAM peak
	- Embeddings: Shared model (no additional memory)
	- Context (128K): ~2GB additional for full context

	### Development Setup
	```bash
	git clone https://github.com/SouthpawIN/senter-omni.git
	cd senter-omni
	pip install -r requirements.txt
	python senter_omni_demo.py # Test installation
	```
	---

	## 📄 License

	Apache 2.0 License - See [LICENSE](LICENSE) for details.

	This project uses:
	- Qwen2.5-Omni: Apache 2.0 (Alibaba Cloud)
	- Training Datasets: Various open licenses
	- Code: Apache 2.0

	---

	## 🙏 Acknowledgments

	- Alibaba Cloud for Qwen2.5-Omni architecture
	- Nous Research for Hermes dataset and inspiration
	- Alignment Lab AI for development and training
	- Unsloth for efficient training framework
	- HuggingFace for model hosting and tools
	- Open Source Community for datasets and tools
	---

	<div align="center">

	🎭 EXPERIENCE THE FUTURE OF MULTIMODAL AI WITH SENTER-OMNI

	Built with ❤️ by sovthpaw at Alignment Lab AI

	Donations:

	https://www.paypal.me/Sellgames1l
	</div>