SykoOmni V1.1-Beta
SykoOmni is a multimodal AI model that combines three models into a single .safetensors file.
Built from scratch by @SykoSLM.
Architecture
| Module | Model | Task |
|---|---|---|
| 🧠 Text (Orchestrator) | Qwen2.5-0.5B-Instruct | Text generation, routing |
| 🎨 Image | SDXL-Turbo | Text-to-image |
| 🎵 Audio | Microsoft SpeechT5 | Text-to-speech |
All weights are merged into a single syko_omni_merged.safetensors file (~8.5 GB).
Tokenizers and configs are stored separately in their respective folders.
Repo Structure
SykoSLM/SykoOmni-V1.1-Beta/
├── syko_omni_merged.safetensors # All model weights
├── config.json # Model metadata
├── load_syko_omni.py # Load & inference script
├── text_tokenizer/ # Qwen tokenizer
├── audio_tokenizer/ # SpeechT5 processor
└── image_tokenizer/ # SDXL configs
Model Test
A sweet dog watching the sea outside the window.
How It Works
The text model (Qwen) acts as the orchestrator. It decides what to generate based on special tokens:
<img_start>...<img_end>→ triggers SDXL image generation<aud_start>...<aud_end>→ triggers SpeechT5 audio generation
Usage
# Install dependencies
# pip install torch transformers diffusers safetensors soundfile accelerate huggingface_hub
import torch
import os
from huggingface_hub import snapshot_download
from safetensors.torch import load_file
from transformers import (
AutoTokenizer, Qwen2Config, Qwen2ForCausalLM,
SpeechT5ForTextToSpeech, SpeechT5Config,
SpeechT5Processor, SpeechT5HifiGan, SpeechT5HifiGanConfig,
CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer, CLIPTextConfig
)
from diffusers import (
StableDiffusionXLPipeline, UNet2DConditionModel,
AutoencoderKL, EulerAncestralDiscreteScheduler
)
REPO = "SykoSLM/SykoOmni-V1.1-Beta"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if DEVICE == "cuda" else torch.float32
print("Repo indiriliyor/kontrol ediliyor...")
repo_dir = snapshot_download(repo_id=REPO)
merged_path = os.path.join(repo_dir, "syko_omni_merged.safetensors")
print("Birleştirilmiş safetensors dosyası belleğe alınıyor...")
merged = load_file(merged_path, device="cpu")
def _extract(merged, prefix):
"""Safetensors içinden sadece ilgili modelin ağırlıklarını süzer."""
p = prefix + "."
return {k[len(p):]: v for k, v in merged.items() if k.startswith(p)}
# ==========================================
# 1. METİN MODELİ (QWEN) YÜKLEME
# ==========================================
print("🧠 Metin modeli (Qwen) yükleniyor...")
text_tokenizer = AutoTokenizer.from_pretrained(os.path.join(repo_dir, "text_tokenizer"))
text_model = Qwen2ForCausalLM(Qwen2Config.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct"))
text_model.load_state_dict(_extract(merged, "qwen"), strict=False)
text_model = text_model.to(DTYPE).to(DEVICE).eval()
# ==========================================
# 2. GÖRÜNTÜ MODELİ (SDXL-TURBO) YÜKLEME
# ==========================================
print("🎨 Görüntü modeli (SDXL) yükleniyor...")
# Diffusers'ın devasa ağırlıkları indirmesini engellemek için sadece "Config" (ayar)
# dosyalarını kullanarak boş iskelet modeller oluşturuyoruz.
unet = UNet2DConditionModel.from_config("stabilityai/sdxl-turbo", subfolder="unet")
vae = AutoencoderKL.from_config("stabilityai/sdxl-turbo", subfolder="vae")
scheduler = EulerAncestralDiscreteScheduler.from_config("stabilityai/sdxl-turbo", subfolder="scheduler")
tokenizer = CLIPTokenizer.from_pretrained("stabilityai/sdxl-turbo", subfolder="tokenizer")
tokenizer_2 = CLIPTokenizer.from_pretrained("stabilityai/sdxl-turbo", subfolder="tokenizer_2")
text_encoder = CLIPTextModel(CLIPTextConfig.from_pretrained("stabilityai/sdxl-turbo", subfolder="text_encoder"))
text_encoder_2 = CLIPTextModelWithProjection(CLIPTextConfig.from_pretrained("stabilityai/sdxl-turbo", subfolder="text_encoder_2"))
# Boş iskeletlere senin safetensors dosyasındaki ağırlıkları basıyoruz
unet.load_state_dict(_extract(merged, "sdxl_unet"), strict=False)
vae.load_state_dict(_extract(merged, "sdxl_vae"), strict=False)
text_encoder.load_state_dict(_extract(merged, "sdxl_text_enc"), strict=False)
text_encoder_2.load_state_dict(_extract(merged, "sdxl_text_enc2"), strict=False)
# Tüm parçaları birleştirip Pipeline'ı oluşturuyoruz
image_pipe = StableDiffusionXLPipeline(
vae=vae,
text_encoder=text_encoder,
text_encoder_2=text_encoder_2,
tokenizer=tokenizer,
tokenizer_2=tokenizer_2,
unet=unet,
scheduler=scheduler
).to(device=DEVICE, dtype=DTYPE)
# ==========================================
# 3. SES MODELİ (SPEECHT5) YÜKLEME
# ==========================================
print("🎵 Ses modeli (SpeechT5) yükleniyor...")
audio_processor = SpeechT5Processor.from_pretrained(os.path.join(repo_dir, "audio_tokenizer"))
audio_model = SpeechT5ForTextToSpeech(SpeechT5Config())
audio_model.load_state_dict(_extract(merged, "speecht5"), strict=False)
audio_model = audio_model.to(DTYPE).to(DEVICE).eval()
vocoder = SpeechT5HifiGan(SpeechT5HifiGanConfig())
vocoder.load_state_dict(_extract(merged, "vocoder"), strict=False)
vocoder = vocoder.to(DEVICE)
print("✅ Tüm modeller başarıyla SykoOmni üzerinden yüklendi ve kullanıma hazır!")
Discord Bot
SykoOmni can be run as a Discord bot with three commands:
| Command | Description |
|---|---|
!Syko <prompt> |
Chat with the text model (can generate images/audio too) |
!SykoGörsel <prompt> |
Generate an image directly |
!SykoSes <text> |
Generate audio directly |
!reset |
Clear conversation history |
Requirements
torch
transformers
diffusers
safetensors
soundfile
accelerate
huggingface_hub
discord.py
Notes
- Minimum 16GB VRAM recommended for full fp16 inference
- On T4 (16GB): enable
model_cpu_offloadandattention_slicingfor SDXL - Audio output may have slight noise — SpeechT5 fine-tuning planned for V1.2
- Text model is small (0.5B), complex instruction following may be inconsistent
Roadmap
- Replace SpeechT5 with a fine-tuned version
- Upgrade text model to a larger variant
- Add image understanding (vision encoder)
- CMN (Chunked Memory Network) integration
License
Apache 2.0
- Downloads last month
- 99
