تصميم بدون عنوان (3)

ONYX-AI-4Gemma is the Ultra-Optimized version of the Gemma 4 architecture. This edition is specifically engineered for Low-Latency Inference and high-speed responsiveness, even on modest hardware.

By moving beyond standard quantization and implementing Raw Streaming and SDPA (Scaled Dot Product Attention), we’ve achieved response times that mimic high-end AI platforms like Grok.


⚡ Performance & "Light-Speed" Features

  • 🚀 Instant-Response Streaming: Tokens are delivered bit-by-bit via Pure Text Streaming, eliminating waiting time and buffering.
  • 🧠 SDPA Engine: Uses Scaled Dot Product Attention for ultra-fast context processing.
  • ⚙️ CPU Turbo Mode: Optimized with torch.inference_mode and channels_last memory format to squeeze every bit of power from your CPU.
  • 💾 Smart 4-bit NF4: Reduces memory footprint to ~10GB without sacrificing the model's core intelligence.
  • 🚫 Zero-Buffering: Custom FastAPI headers (X-Accel-Buffering: no) ensure real-time interaction, perfect for Flutter and Web integrations.

⏱ Performance Metrics

Hardware Generation Speed Latency
Standard CPU ~5-15 tokens/sec Low
Modern CPU (Multi-threaded) ~20-40 tokens/sec Ultra-Low
GPU (T4/RTX) Blazing Fast Near-Zero

🚀 Technical Innovations (The ONYX Way)

  • 🔥 BFloat16 Computation: Higher precision and faster math on modern processors.
  • 💬 Chat-Template Native: Fully compatible with standard chat templates for seamless multi-turn conversations.
  • 🕸 Anti-Buffering Architecture: Designed to work perfectly with front-end frameworks without data: prefix overhead.
  • 🔎 RAG Ready: Built-in support for external knowledge injection via FAISS.

💻 System Requirements (The Sweet Spot)

To experience ONYX-AI-4Gemma at "Light-Speed," here are the recommended hardware specifications:

🧠 Processor (CPU)

  • Architecture: x86_64 or ARM64.
  • Cores: Minimum 4 Cores, Recommended 8+ Cores.
  • Instruction Sets: Optimized for CPUs supporting AVX2 and BFloat16 (found in modern AMD Ryzen & Intel Core series).
  • Performance Note: ONYX utilizes torch.set_num_threads to leverage every single thread of your processor for maximum generation speed.

⚡ Memory (RAM)

  • Minimum: 12 GB (The model footprint is ~10GB).
  • Recommended: 16 GB - 32 GB for a smooth multitasking experience and larger context handling.
  • Type: DDR4 or DDR5 (Higher memory bandwidth directly translates to faster token generation).

🎮 Graphics (Optional GPU)

  • While ONYX is a CPU-First powerhouse, it can run on GPUs with 12GB+ VRAM (like NVIDIA RTX 3060/4060 or T4) for even more extreme performance.

🎮 Live Demo

Experience the speed of ONYX-AI-4Gemma right now on Hugging Face Spaces:

Try it on Spaces

🛠 Installation

pip install -r requirements.txt

💻 Usage (Gradio Chat Interface)

This model is designed to run easily with a Gradio web UI.

▶️ Full Example

import os
import torch
import gradio as gr
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, AutoConfig

# 🚀 تسريع CPU
torch.set_num_threads(os.cpu_count())

model_id = "ONYX-APP-AI/onyx-ai-4gemma"
token = os.environ.get("HF_TOKEN")

print("--- 💎 Initializing ONYX for CPU ---")

config = AutoConfig.from_pretrained(model_id, token=token, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=token)

print("--- 📦 Loading Model ---")

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=config,
    device_map="cpu",
    token=token,
    trust_remote_code=True,
    torch_dtype=torch.float32,
    low_cpu_mem_usage=True
)

# 🚀 أسرع دالة
def chat_onyx(message, history):
    messages = []

    for turn in history:
        if isinstance(turn, (list, tuple)):
            messages.append({"role": "user", "content": turn[0]})
            messages.append({"role": "assistant", "content": turn[1]})

    messages.append({"role": "user", "content": message})

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True
    ).to("cpu")

    streamer = TextIteratorStreamer(
        tokenizer,
        skip_prompt=True,
        skip_special_tokens=True
    )

    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=80,   # ⚡ أسرع بكثير
        do_sample=False      # ⚡ greedy decoding أسرع
    )

    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    partial = ""
    for token in streamer:
        partial += token
        yield partial

with gr.Blocks() as demo:
    gr.Markdown("# 💎 ONYX AI - FAST CPU MODE")
    gr.ChatInterface(fn=chat_onyx)

demo.launch()

📦 Requirements

transformers>=4.40.0
bitsandbytes
accelerate
torch
gradio
sentencepiece

🧠 Notes

  • Reduce max_new_tokens for faster responses.
  • Use do_sample=False for deterministic and faster output.
  • For better performance, run on GPU (T4 or higher recommended).
  • Large context history may slow down CPU inference.

⚠️ Limitations

  • CPU mode is significantly slower than GPU.
  • Long conversations may reduce performance.
  • Quantization reduces size but not all latency overhead.

💎 Credits

Developed by RUI Company Project: ONYX AI System

Downloads last month
281
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using ONYX-APP-AI/onyx-ai-4gemma 2

Collection including ONYX-APP-AI/onyx-ai-4gemma