ONYX-Spead-AI
Collection
A curated collection of ultra-fast, low-latency AI models engineered for light-speed performance. Optimized for raw streaming, CPU turbo-acceleration, • 1 item • Updated • 1
ONYX-AI-4Gemma is the Ultra-Optimized version of the Gemma 4 architecture. This edition is specifically engineered for Low-Latency Inference and high-speed responsiveness, even on modest hardware.
By moving beyond standard quantization and implementing Raw Streaming and SDPA (Scaled Dot Product Attention), we’ve achieved response times that mimic high-end AI platforms like Grok.
torch.inference_mode and channels_last memory format to squeeze every bit of power from your CPU.X-Accel-Buffering: no) ensure real-time interaction, perfect for Flutter and Web integrations.| Hardware | Generation Speed | Latency |
|---|---|---|
| Standard CPU | ~5-15 tokens/sec | Low |
| Modern CPU (Multi-threaded) | ~20-40 tokens/sec | Ultra-Low |
| GPU (T4/RTX) | Blazing Fast | Near-Zero |
data: prefix overhead.To experience ONYX-AI-4Gemma at "Light-Speed," here are the recommended hardware specifications:
torch.set_num_threads to leverage every single thread of your processor for maximum generation speed.Experience the speed of ONYX-AI-4Gemma right now on Hugging Face Spaces:
pip install -r requirements.txt
This model is designed to run easily with a Gradio web UI.
import os
import torch
import gradio as gr
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, AutoConfig
# 🚀 تسريع CPU
torch.set_num_threads(os.cpu_count())
model_id = "ONYX-APP-AI/onyx-ai-4gemma"
token = os.environ.get("HF_TOKEN")
print("--- 💎 Initializing ONYX for CPU ---")
config = AutoConfig.from_pretrained(model_id, token=token, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=token)
print("--- 📦 Loading Model ---")
model = AutoModelForCausalLM.from_pretrained(
model_id,
config=config,
device_map="cpu",
token=token,
trust_remote_code=True,
torch_dtype=torch.float32,
low_cpu_mem_usage=True
)
# 🚀 أسرع دالة
def chat_onyx(message, history):
messages = []
for turn in history:
if isinstance(turn, (list, tuple)):
messages.append({"role": "user", "content": turn[0]})
messages.append({"role": "assistant", "content": turn[1]})
messages.append({"role": "user", "content": message})
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
).to("cpu")
streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True,
skip_special_tokens=True
)
generation_kwargs = dict(
**inputs,
streamer=streamer,
max_new_tokens=80, # ⚡ أسرع بكثير
do_sample=False # ⚡ greedy decoding أسرع
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
partial = ""
for token in streamer:
partial += token
yield partial
with gr.Blocks() as demo:
gr.Markdown("# 💎 ONYX AI - FAST CPU MODE")
gr.ChatInterface(fn=chat_onyx)
demo.launch()
transformers>=4.40.0
bitsandbytes
accelerate
torch
gradio
sentencepiece
max_new_tokens for faster responses.do_sample=False for deterministic and faster output.Developed by RUI Company Project: ONYX AI System