π JiRack Ternary 70B β Proprietary Ternary-Quantized Transformer
Under training yet: DEV
Inventor: Konstantin Vladimirovich Grabko
Contact: grabko@cmsmanhattan.com
Phone: +1 (516) 777-0945
Status: [PATENT PENDING] β Claims Filed December 21, 2025
Needed: A sponsor for Llama 70b Distilation to high quality as original Llama 70b or Data Center with cooperation
π Overview
JiRack Ternary 70B is a revolutionary ternary-quantized implementation of a 70-billion parameter Transformer model, achieving ~70% VRAM reduction while maintaining near-baseline perplexity. This model uses BitNet-style ternary quantization (${-1, 0, +1}$) with proprietary innovations including:
- β Ternary-Quantized Optimization & Bitwise Unpacking
- β Buffered Routing Embedding (BRE)
- β SwiGLU-Attention (SWA) Fusion
- β Hardware-Agnostic Layer-wise Offloading
This model is compatible with the meta-llama/Llama-3.2-70B tokenizer and supports the safetensors format for secure, efficient loading.
π― Key Features
π¬ Ternary Quantization (1.58-bit)
Weights are quantized to ternary values ${-1, 0, +1}$ using a proprietary bitwise unpacking kernel that extracts 4 parameters from a single byte:
| Parameter | Bitwise Operation | Range |
|---|---|---|
| Param 1 | (p >> 6) & 0b11 |
0-3 |
| Param 2 | (p >> 4) & 0b11 |
0-3 |
| Param 3 | (p >> 2) & 0b11 |
0-3 |
| Param 4 | p & 0b11 |
0-3 |
Unpacking Equation:
where $\gamma$ is a group-wise scaling factor computed per 128-parameter group.
πΎ VRAM Efficiency
| Metric | Traditional FP16 | JiRack Ternary 70B |
|---|---|---|
| Memory Footprint | ~140 GB | ~42 GB |
| Memory Reduction | Baseline | ~70% |
| Perplexity Impact | Baseline | <1.5% degradation |
| Thermal Profile | 80-90Β°C | <75Β°C |
π₯ Thermal Optimization
The SwiGLU-Attention (SWA) Fusion kernel merges FFN and MHA operations, reducing activation memory and keeping GPU temperatures below 75Β°C during inference.
π₯οΈ Hardware Compatibility
Tested and validated on:
- β NVIDIA RTX 4080 (16GB VRAM)
- β AMD Radeon 7900 XT (20GB VRAM) with ROCm
- β Multi-GPU setups (PCIe 4.0)
- β Consumer-grade hardware configurations
ποΈ Architecture Specifications
| Parameter | Value |
|---|---|
| Total Parameters | 70 Billion |
| Hidden Dimension | 8,192 |
| Intermediate Dimension | 28,672 |
| Number of Layers | 80 |
| Attention Heads | 64 |
| Group Size (N) | 128 |
| Quantization | Ternary (1.58-bit) |
| Weight Format | safetensors |
| Tokenizer | meta-llama/Llama-3.2-70B compatible |
π Usage
Installation
pip install transformers torch safetensors accelerate
Loading the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer (compatible with Llama 3.2 70B)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-70B")
# Load JiRack Ternary 70B model
model = AutoModelForCausalLM.from_pretrained(
"kgrabko2/jirack-ternary-70b",
trust_remote_code=True,
device_map="auto", # Automatic layer-wise offloading
torch_dtype="auto"
)
# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
Advanced: Multi-GPU Inference
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(
"kgrabko2/jirack-ternary-70b",
trust_remote_code=True
)
model = load_checkpoint_and_dispatch(
model,
"kgrabko2/jirack-ternary-70b",
device_map="auto",
no_split_module_classes=["JiRackDecoderLayer"]
)
π Performance Benchmarks
Memory Efficiency
- FP16 Baseline: ~140 GB VRAM
- JiRack Ternary: ~42 GB VRAM
- Reduction: 70%
Inference Speed
| Hardware | Tokens/sec (FP16) | Tokens/sec (JiRack) | Speedup |
|---|---|---|---|
| RTX 4080 (16GB) | OOM | ~12 tok/s | β |
| 7900 XT (20GB) | OOM | ~15 tok/s | β |
| 2x RTX 4090 (48GB) | ~8 tok/s | ~28 tok/s | 3.5x |
Perplexity (WikiText-2)
- FP16 Baseline: 5.23
- JiRack Ternary: 5.31
- Degradation: <1.5%
π¬ Technical Deep Dive
Bitwise Unpacking Kernel
def unpack_weights(self):
if self.packed_weights is None:
return self.weight
p = self.packed_weights
# Extract 4 params from 1 byte using bit shifts
b1, b2, b3, b4 = (p >> 6) & 0b11, (p >> 4) & 0b11, (p >> 2) & 0b11, p & 0b11
unpacked = torch.stack([b1, b2, b3, b4], dim=1).view(-1)
# Apply offset and group-wise scaling
weights = (unpacked[:num_el].to(torch.float16) - 1.0).view(-1, self.group_size)
weights = weights * self.weight_scale.view(-1, 1)
return weights.view(tuple(self.orig_shape.tolist()))
Layer-wise Offloading
The model automatically distributes layers across available GPUs/NPUs, ensuring:
- β Asynchronous memory pooling
- β Dynamic device allocation per layer
- β Prevention of OOM errors on consumer hardware
π Scaling to 405B Parameters
JiRack 405B Roadmap
Current Need: Sponsor for Llama 405B distillation to match original quality or partnership with a data center.
Projected Specifications
| Parameter | 405B Configuration |
|---|---|
| Memory Footprint | ~243 GB (vs ~810 GB FP16) |
| VRAM Reduction | ~70% |
| LoRA Fine-tuning | ~245 GB (4x RTX 4090) |
| Thermal Profile | <80Β°C with SWA Fusion |
Benefits of JiRack 405B
β
Easy Fine-tuning: LoRA adapters (r=16) require only ~200 MB
β
Consumer Hardware: Fits on 4x RTX 4090 with offloading
β
Thermal Stability: SWA Fusion maintains <80Β°C during training
βοΈ Intellectual Property & Licensing
π Patent Pending
Status: Formal claims filed December 21, 2025
Core IP Claims:
- Ternary-Quantized Optimization & Bitwise Unpacking
- Buffered Routing Embedding (BRE)
- SwiGLU-Attention (SWA) Fusion
- Hardware-Agnostic Layer-wise Offloading
π License Terms
- Non-Commercial Use: Permitted for research and evaluation
- Commercial Use: Requires CMS Manhattan JiRack License v1.2 execution
- Anti-Patent Clause: Users cannot file patents based on disclosed methods
- Non-Transferable: Access does not transfer IP ownership
π§ Licensing Inquiries: grabko@cmsmanhattan.com
π¦ Model Files
This repository contains:
- β Ternary-quantized weights (safetensors format)
- β Custom modeling code (trust_remote_code required)
- β Tokenizer configuration (Llama 3.2 compatible)
- β LICENSE and NDA.md
π€ Collaboration Opportunities
Looking For:
- 405B Distillation Sponsor β Partner to distill Llama 405B to JiRack ternary format
- Data Center Partnership β Collaboration for large-scale training infrastructure
- Commercial Licensees β SaaS, hardware integration, cloud deployment
Contact
Konstantin Vladimirovich Grabko
π§ grabko@cmsmanhattan.com
π +1 (516) 777-0945
π Plainview, New York, USA
π Citation
If you use this model in your research, please cite:
@software{grabko2025jirack,
author = {Grabko, Konstantin Vladimirovich},
title = {JiRack Ternary 70B: Proprietary Ternary-Quantized Transformer},
year = {2025},
publisher = {CMS Manhattan},
url = {https://huggingface.co/kgrabko2/jirack-ternary-70b},
note = {Patent Pending}
}
β οΈ Disclaimer
This model contains proprietary technology protected by pending patents. All methods, architectures, and techniques disclosed are the intellectual property of Konstantin Vladimirovich Grabko. See LICENSE for full terms.
π Related Resources
- Base Model: meta-llama/Meta-Llama-3.1-70B
- Tokenizer: meta-llama/Llama-3.2-70B
- License: LICENSE
- Patent Documentation: See repository files
Made with π₯ by CMS Manhattan β Pushing the boundaries of efficient LLM inference
- Downloads last month
- 52
Model tree for kgrabko/JiRackTernary_70b
Base model
meta-llama/Llama-3.1-70BEvaluation results
- Perplexity Degradationself-reported<1.5%
- Memory Reductionself-reported70%