OktoBLAS
π Beats PyTorch by up to 21% β’ Fused Attention 3.8x Faster π
π₯ Performance
FP16 GEMM
| Matrix Size | OktoBLAS | PyTorch | Result |
|---|---|---|---|
| 1024Γ1024 | 33.9 TFLOPS | 30.0 TFLOPS | +13.1% π₯ |
| 2048Γ2048 | 40.6 TFLOPS | 33.7 TFLOPS | +20.6% π₯π₯ |
| 4096Γ4096 | 42.1 TFLOPS | 40.1 TFLOPS | +5.0% β |
Fused Attention
| Configuration | OktoBLAS | PyTorch | Speedup |
|---|---|---|---|
| B4 S256 D64 | 1.06 TFLOPS | 0.28 TFLOPS | 3.8x π₯ |
| B4 S512 D64 | 1.20 TFLOPS | 0.93 TFLOPS | 1.3x β |
| B8 S256 D64 | 1.17 TFLOPS | 0.55 TFLOPS | 2.1x β |
π Benchmarks on NVIDIA RTX 4070 Laptop GPU
What is OktoBLAS?
OktoBLAS is a proprietary, high-performance BLAS engine developed by OktoSeek. It is the core computational backbone of OktoEngine, our native AI training platform.
Built 100% from scratch with zero dependency on NVIDIA cuBLAS.
π― Key Highlights
| 100% Independent | No cuBLAS dependency |
| Beats PyTorch | Up to +21% faster π₯ |
| Fused Attention | Up to 3.8x faster π₯ |
| Production Ready | Powers OktoEngine |
π± Energy Savings & Environmental Impact
OktoBLAS helps save energy and reduce COβ emissions worldwide.
By running AI workloads 12% faster, OktoBLAS reduces GPU power consumption significantly:
| Scale | GPUs | Annual Energy Saved | COβ Reduced | Cost Saved |
|---|---|---|---|---|
| Startup | 1-4 | 400-1,700 kWh | 160-680 kg | $60-$260 |
| SMB | 8-32 | 2,300-12,000 kWh | 0.9-4.8 ton | $350-$1,800 |
| Enterprise | 64-256 | 27,000-107,000 kWh | 11-43 ton | $4,000-$16,000 |
| Hyperscaler | 1024+ | 680,000+ kWh | 272+ ton | $102,000+ |
π Impact for Humanity
Every GPU-hour saved means:
- Less electricity consumed from power plants
- Less COβ emissions into the atmosphere
- Lower costs for AI research and development
- More accessible AI for everyone
This is why OktoSeek created OktoBLAS β not just for performance, but for a sustainable AI future.
π¬ OktoSeek Research Mission
One of OktoSeek's primary research areas is developing new mathematical techniques and optimization methods that reduce AI training time without compromising model quality.
Why This Matters for Humanity
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE PROBLEM WE'RE SOLVING β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Today, training a large AI model costs: β
β β
β π° $100,000 to $10,000,000+ in compute β
β β‘ 1,000,000+ kWh of electricity β
β π Weeks to months of GPU time β
β π Tons of COβ emissions β
β β
β This means only big companies can create AI. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
OktoSeek's Solution
By making training faster and cheaper, we enable:
| Benefit | Impact |
|---|---|
| π§βπ¬ Researchers | More experiments in less time |
| π« Universities | Train models on limited budgets |
| π Startups | Compete with big tech companies |
| π Developing Nations | Access to AI creation, not just consumption |
| π± Planet Earth | Less energy = less carbon emissions |
The Vision
"We believe AI should be accessible to everyone β not just those who can afford million-dollar GPU clusters. By making training 12%+ faster with the same hardware, we're democratizing AI creation and building a more sustainable future."
β OktoSeek Research Team
Faster training means:
- β More people can create AI
- β More innovations in less time
- β Lower barriers to entry
- β Smaller environmental footprint
π§ Architecture
OktoBLAS is the computational core of the OktoSeek platform:
OktoScript β OktoEngine β OktoBLAS β GPU (Tensor Cores)
π¦ Python Package
OktoBLAS is available as a standalone Python package.
Installation
pip install oktoblas
Quick Start
import oktoblas as ob
import numpy as np
# FP16 Matrix Multiplication (Tensor Cores)
A = np.random.randn(2048, 2048).astype(np.float16)
B = np.random.randn(2048, 2048).astype(np.float16)
C = ob.matmul_fp16(A, B) # 40+ TFLOPS
# Fused Attention (3x faster)
Q = np.random.randn(4, 512, 64).astype(np.float32)
K = np.random.randn(4, 512, 64).astype(np.float32)
V = np.random.randn(4, 512, 64).astype(np.float32)
output = ob.attention(Q, K, V)
# Library info
ob.info()
API Reference
# GEMM Operations
ob.matmul(A, B) # FP32 matrix multiplication
ob.matmul_fp16(A, B) # FP16 with Tensor Cores
# Fused Operations
ob.attention(Q, K, V) # Fused QΓK^TΓV attention
# Utilities
ob.info() # Library information
ob.is_cuda_available() # Check GPU availability
ob.get_device_info() # GPU details
ob.benchmark(op, size) # Run benchmarks
π Maximum Performance Guide
For best results with OktoBLAS:
- Enable cuDNN benchmark
- Use FP16 and Tensor Cores
- Enable automatic mixed precision (AMP)
π§ͺ OktoScript Integration
Within OktoEngine, OktoBLAS is configured through OktoScript v1.3+:
# okto_version: "1.3"
PROJECT "my-ai-model"
# Enable OktoBLAS as BLAS backend
BLAS {
backend: "oktoblas"
precision: "fp16"
}
# Accelerate operations with OktoBLAS
ACCELERATE {
gemm: "oktoblas"
attention: "oktoblas"
fused_ops: true
}
# Enable Tensor Cores
TENSOR_CORES {
enabled: true
precision: "fp16"
}
MODEL {
base: "gpt2"
device: "cuda"
}
TRAIN {
epochs: 3
batch_size: 16
mixed_precision: true
}
# Performance optimization
OPTIMIZE {
cudnn_benchmark: true
tf32: true
}
Run Training
# Standard training
okto train -f train.okt
# With verbose performance logging
okto train -f train.okt --verbose --show-tflops
Expected Output
[OktoBLAS] Device: NVIDIA RTX 4070
[OktoBLAS] FP16 GEMM: 40.6 TFLOPS (beats PyTorch!)
Step 100 | Loss: 2.45 | Speed: 520 ex/s | TFLOPS: 40.2
Step 200 | Loss: 1.89 | Speed: 518 ex/s | TFLOPS: 39.9
...
Training complete! Average: 515 ex/s
π OktoSeek Ecosystem
OktoBLAS is a core component of the OktoSeek AI platform β a complete ecosystem for building, training, and deploying AI models with maximum efficiency.
| Component | Description | Status |
|---|---|---|
| OktoScript | The AI Programming Language β DSL for model training | β Popular |
| OktoEngine | Native AI Training Runtime β powered by OktoBLAS | Production |
| OktoBLAS | High-Performance BLAS β Beats PyTorch by 21%! | PyPI |
| OkTensor | GPU Tensor Library | Production |
| OktoStudio | AI Development IDE | Coming Soon |
π Examples
examples/python/β Python usage examplesdocs/ENTERPRISE_SAVINGS.mdβ Energy & Cost Savings
π License
OktoBLAS Binary License β Proprietary
Free for personal and commercial use. Redistribution and modification of binaries prohibited.
Copyright Β© 2025 OktoSeek AI. All Rights Reserved.
See LICENSE for full terms.
π Links
| Website | oktoseek.com |
| PyPI | pypi.org/project/oktoblas |
| GitHub | github.com/oktoseek |
| @oktoseek |
π OktoBLAS β The First Independent BLAS to Beat PyTorch π
Made with precision by OktoSeek AI