πŸš€ JiRack Ternary 70B β€” Proprietary Ternary-Quantized Transformer

Patent Pending License VRAM Reduction Thermal

Under training yet: DEV

Inventor: Konstantin Vladimirovich Grabko
Contact: grabko@cmsmanhattan.com
Phone: +1 (516) 777-0945
Status: [PATENT PENDING] β€” Claims Filed December 21, 2025

Needed: A sponsor for Llama 70b Distilation to high quality as original Llama 70b or Data Center with cooperation


πŸ“‹ Overview

JiRack Ternary 70B is a revolutionary ternary-quantized implementation of a 70-billion parameter Transformer model, achieving ~70% VRAM reduction while maintaining near-baseline perplexity. This model uses BitNet-style ternary quantization (${-1, 0, +1}$) with proprietary innovations including:

  • βœ… Ternary-Quantized Optimization & Bitwise Unpacking
  • βœ… Buffered Routing Embedding (BRE)
  • βœ… SwiGLU-Attention (SWA) Fusion
  • βœ… Hardware-Agnostic Layer-wise Offloading

This model is compatible with the meta-llama/Llama-3.2-70B tokenizer and supports the safetensors format for secure, efficient loading.


🎯 Key Features

πŸ”¬ Ternary Quantization (1.58-bit)

Weights are quantized to ternary values ${-1, 0, +1}$ using a proprietary bitwise unpacking kernel that extracts 4 parameters from a single byte:

Parameter Bitwise Operation Range
Param 1 (p >> 6) & 0b11 0-3
Param 2 (p >> 4) & 0b11 0-3
Param 3 (p >> 2) & 0b11 0-3
Param 4 p & 0b11 0-3

Unpacking Equation:

w=(bβˆ’1.0)Γ—Ξ³w = (b - 1.0) \times \gamma

where $\gamma$ is a group-wise scaling factor computed per 128-parameter group.

πŸ’Ύ VRAM Efficiency

Metric Traditional FP16 JiRack Ternary 70B
Memory Footprint ~140 GB ~42 GB
Memory Reduction Baseline ~70%
Perplexity Impact Baseline <1.5% degradation
Thermal Profile 80-90Β°C <75Β°C

πŸ”₯ Thermal Optimization

The SwiGLU-Attention (SWA) Fusion kernel merges FFN and MHA operations, reducing activation memory and keeping GPU temperatures below 75Β°C during inference.

πŸ–₯️ Hardware Compatibility

Tested and validated on:

  • βœ… NVIDIA RTX 4080 (16GB VRAM)
  • βœ… AMD Radeon 7900 XT (20GB VRAM) with ROCm
  • βœ… Multi-GPU setups (PCIe 4.0)
  • βœ… Consumer-grade hardware configurations

πŸ—οΈ Architecture Specifications

Parameter Value
Total Parameters 70 Billion
Hidden Dimension 8,192
Intermediate Dimension 28,672
Number of Layers 80
Attention Heads 64
Group Size (N) 128
Quantization Ternary (1.58-bit)
Weight Format safetensors
Tokenizer meta-llama/Llama-3.2-70B compatible

πŸš€ Usage

Installation

pip install transformers torch safetensors accelerate

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer (compatible with Llama 3.2 70B)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-70B")

# Load JiRack Ternary 70B model
model = AutoModelForCausalLM.from_pretrained(
    "kgrabko2/jirack-ternary-70b",
    trust_remote_code=True,
    device_map="auto",  # Automatic layer-wise offloading
    torch_dtype="auto"
)

# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Advanced: Multi-GPU Inference

from accelerate import init_empty_weights, load_checkpoint_and_dispatch

with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        "kgrabko2/jirack-ternary-70b",
        trust_remote_code=True
    )

model = load_checkpoint_and_dispatch(
    model,
    "kgrabko2/jirack-ternary-70b",
    device_map="auto",
    no_split_module_classes=["JiRackDecoderLayer"]
)

πŸ“Š Performance Benchmarks

Memory Efficiency

  • FP16 Baseline: ~140 GB VRAM
  • JiRack Ternary: ~42 GB VRAM
  • Reduction: 70%

Inference Speed

Hardware Tokens/sec (FP16) Tokens/sec (JiRack) Speedup
RTX 4080 (16GB) OOM ~12 tok/s ∞
7900 XT (20GB) OOM ~15 tok/s ∞
2x RTX 4090 (48GB) ~8 tok/s ~28 tok/s 3.5x

Perplexity (WikiText-2)

  • FP16 Baseline: 5.23
  • JiRack Ternary: 5.31
  • Degradation: <1.5%

πŸ”¬ Technical Deep Dive

Bitwise Unpacking Kernel

def unpack_weights(self):
    if self.packed_weights is None: 
        return self.weight
    
    p = self.packed_weights
    
    # Extract 4 params from 1 byte using bit shifts
    b1, b2, b3, b4 = (p >> 6) & 0b11, (p >> 4) & 0b11, (p >> 2) & 0b11, p & 0b11
    unpacked = torch.stack([b1, b2, b3, b4], dim=1).view(-1)
    
    # Apply offset and group-wise scaling
    weights = (unpacked[:num_el].to(torch.float16) - 1.0).view(-1, self.group_size)
    weights = weights * self.weight_scale.view(-1, 1)
    
    return weights.view(tuple(self.orig_shape.tolist()))

Layer-wise Offloading

The model automatically distributes layers across available GPUs/NPUs, ensuring:

  • βœ… Asynchronous memory pooling
  • βœ… Dynamic device allocation per layer
  • βœ… Prevention of OOM errors on consumer hardware

πŸŽ“ Scaling to 405B Parameters

JiRack 405B Roadmap

Current Need: Sponsor for Llama 405B distillation to match original quality or partnership with a data center.

Projected Specifications

Parameter 405B Configuration
Memory Footprint ~243 GB (vs ~810 GB FP16)
VRAM Reduction ~70%
LoRA Fine-tuning ~245 GB (4x RTX 4090)
Thermal Profile <80Β°C with SWA Fusion

Benefits of JiRack 405B

βœ… Easy Fine-tuning: LoRA adapters (r=16) require only ~200 MB
βœ… Consumer Hardware: Fits on 4x RTX 4090 with offloading
βœ… Thermal Stability: SWA Fusion maintains <80Β°C during training


βš–οΈ Intellectual Property & Licensing

πŸ”’ Patent Pending

Status: Formal claims filed December 21, 2025

Core IP Claims:

  1. Ternary-Quantized Optimization & Bitwise Unpacking
  2. Buffered Routing Embedding (BRE)
  3. SwiGLU-Attention (SWA) Fusion
  4. Hardware-Agnostic Layer-wise Offloading

πŸ“œ License Terms

  • Non-Commercial Use: Permitted for research and evaluation
  • Commercial Use: Requires CMS Manhattan JiRack License v1.2 execution
  • Anti-Patent Clause: Users cannot file patents based on disclosed methods
  • Non-Transferable: Access does not transfer IP ownership

πŸ“§ Licensing Inquiries: grabko@cmsmanhattan.com


πŸ“¦ Model Files

This repository contains:

  • βœ… Ternary-quantized weights (safetensors format)
  • βœ… Custom modeling code (trust_remote_code required)
  • βœ… Tokenizer configuration (Llama 3.2 compatible)
  • βœ… LICENSE and NDA.md

🀝 Collaboration Opportunities

Looking For:

  1. 405B Distillation Sponsor β€” Partner to distill Llama 405B to JiRack ternary format
  2. Data Center Partnership β€” Collaboration for large-scale training infrastructure
  3. Commercial Licensees β€” SaaS, hardware integration, cloud deployment

Contact

Konstantin Vladimirovich Grabko

πŸ“§ grabko@cmsmanhattan.com
πŸ“ž +1 (516) 777-0945
πŸ“ Plainview, New York, USA


πŸ“š Citation

If you use this model in your research, please cite:

@software{grabko2025jirack,
  author = {Grabko, Konstantin Vladimirovich},
  title = {JiRack Ternary 70B: Proprietary Ternary-Quantized Transformer},
  year = {2025},
  publisher = {CMS Manhattan},
  url = {https://huggingface.co/kgrabko2/jirack-ternary-70b},
  note = {Patent Pending}
}

⚠️ Disclaimer

This model contains proprietary technology protected by pending patents. All methods, architectures, and techniques disclosed are the intellectual property of Konstantin Vladimirovich Grabko. See LICENSE for full terms.


πŸ”— Related Resources


Made with πŸ”₯ by CMS Manhattan β€” Pushing the boundaries of efficient LLM inference

Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for kgrabko/JiRackTernary_70b

Quantized
(80)
this model

Evaluation results