Upload 4 files
Browse files- NDA.md +52 -0
- claims.md +64 -0
- invention_description.md +84 -0
- performance_data.md +134 -0
NDA.md
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Public Non-Disclosure Agreement (P-NDA)
|
| 2 |
+
|
| 3 |
+
**Invention:** CMS Manhattan JiRack (Ternary Transformer Optimization)
|
| 4 |
+
**Disclosing Party:** Konstantin Vladimirovich Grabko
|
| 5 |
+
**Effective Date:** Upon access, download, or viewing of this repository.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Acceptance of Terms
|
| 10 |
+
|
| 11 |
+
By accessing the source code, architecture diagrams, or technical documentation in this repository, you ("The Recipient") acknowledge that you are entering into a binding confidentiality agreement with Konstantin Vladimirovich Grabko. **If you do not agree to these terms, you must exit this repository and delete any downloaded materials immediately.**
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## 2. Purpose
|
| 16 |
+
|
| 17 |
+
The Confidential Information is provided solely for **evaluation, educational research, or non-commercial testing**. Any other use requires an express Commercial License.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 3. Identification of Confidential Information (Trade Secrets)
|
| 22 |
+
|
| 23 |
+
The following elements are considered **"Trade Secrets"** and are protected under this P-NDA even if the code is publicly hosted:
|
| 24 |
+
- The specific mathematical implementation of the **Ternary Scaling Factor ($\gamma$)**.
|
| 25 |
+
- The internal logic of the **Buffered Routing Embedding (BRE)**.
|
| 26 |
+
- The fused kernel architecture of the **SwiGLU-Attention (SWA)**.
|
| 27 |
+
- Hardware-specific **optimization constants** for ROCm/HIP.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 4. Obligations & Restrictions
|
| 32 |
+
|
| 33 |
+
- **No Reverse Engineering:**
|
| 34 |
+
You shall not attempt to deconstruct the compiled kernels or proprietary logic to create a competing product.
|
| 35 |
+
|
| 36 |
+
- **Non-Disclosure:**
|
| 37 |
+
You shall not share, mirror, or redistribute specific technical optimizations (BRE/SWA) to third parties without including this NDA and the original License.
|
| 38 |
+
|
| 39 |
+
- **No Patent Claiming:**
|
| 40 |
+
You are strictly prohibited from using the information found here to file patent applications or any other intellectual property claims in any jurisdiction.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 5. Termination of Confidentiality
|
| 45 |
+
|
| 46 |
+
The obligations under this Public NDA remain in effect for **three (3) years** from your last access to the materials, or until the Invention is fully disclosed in a granted public patent.
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## 6. Legal Enforcement
|
| 51 |
+
|
| 52 |
+
This agreement is governed by the laws of the State of New York. **Unauthorized use or disclosure of these Trade Secrets may result in statutory damages and injunctive relief.**
|
claims.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Intellectual Property Claims & Patent Pending Notice
|
| 2 |
+
|
| 3 |
+
**Project:** CMS Manhattan JiRack
|
| 4 |
+
**Inventor:** Konstantin Vladimirovich Grabko
|
| 5 |
+
**Contact:** grabko@cmsmanhattan.com
|
| 6 |
+
**Status:** [PATENT PENDING] - Formal Claims Filed/Drafted December 21, 2025.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## ⚠️ NOTICE TO DEVELOPERS AND COMMERCIAL ENTITIES
|
| 11 |
+
|
| 12 |
+
The technologies, architectures, and methods disclosed in this repository are the proprietary intellectual property of Konstantin Vladimirovich Grabko. This document serves as a formal public record of the following claims to establish **Prior Art** and notice of **Patent Pending** status.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## I. Field of Invention
|
| 17 |
+
|
| 18 |
+
This invention pertains to **machine learning optimization**, specifically the **compression and hardware-acceleration of Transformer-based models** for non-NVIDIA (ROCm/HIP) environments.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## II. Core Intellectual Property Claims
|
| 23 |
+
|
| 24 |
+
### 1. Ternary-Quantized Optimization
|
| 25 |
+
A method for reducing model VRAM footprint by quantizing weights into a ternary set $\{-1, 0, +1\}$ utilizing:
|
| 26 |
+
- A learnable scaling factor $\gamma$.
|
| 27 |
+
- A straight-through estimator (STE) to maintain model perplexity.
|
| 28 |
+
- Achieving up to 70% memory reduction.
|
| 29 |
+
|
| 30 |
+
### 2. Buffered Routing Embedding (BRE)
|
| 31 |
+
A proprietary dynamic routing architecture that utilizes **shared memory pools** on High Bandwidth Memory (HBM). This claim covers:
|
| 32 |
+
- The specific **per-layer buffering logic** that minimizes redundant data movement between the GPU global memory and compute units.
|
| 33 |
+
|
| 34 |
+
### 3. SwiGLU-Attention (SWA) Fusion
|
| 35 |
+
A novel fused compute kernel that integrates the **SwiGLU feed-forward network (FFN)** and **Multi-Head Attention (MHA)** into a single operational pass. This claim specifically covers:
|
| 36 |
+
- The reduction of activation memory overhead.
|
| 37 |
+
- The resulting thermal optimization (maintaining $<80^\circ\text{C}$).
|
| 38 |
+
|
| 39 |
+
### 4. Hardware-Agnostic Inference Pipeline
|
| 40 |
+
The specific software stack and **asynchronous memory pooling routine** optimized for ROCm/HIP runtimes, enabling **high-throughput LLM performance on non-proprietary hardware**.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## III. Legal Restrictions & Usage
|
| 45 |
+
|
| 46 |
+
- **Non-Transferable:**
|
| 47 |
+
Access to this code does not constitute a transfer of ownership of the underlying inventions.
|
| 48 |
+
|
| 49 |
+
- **Anti-Patent Clause:**
|
| 50 |
+
Any party using this code is strictly prohibited from filing patent applications based on the **BRE**, **SWA**, or **Ternary-Quantized methods** described herein.
|
| 51 |
+
|
| 52 |
+
- **Commercial Licensing:**
|
| 53 |
+
Any commercial use (SaaS, hardware integration, etc.) requires a **signed execution of the CMS Manhattan JiRack License V.1.2**.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## IV. Contact for IP Inquiries
|
| 58 |
+
|
| 59 |
+
For patent licensing, joint venture opportunities, or freedom-to-operate inquiries, please contact:
|
| 60 |
+
|
| 61 |
+
**Konstantin Vladimirovich Grabko**
|
| 62 |
+
- **Email:** grabko@cmsmanhattan.com
|
| 63 |
+
- **Phone:** +1 (516) 777-0945
|
| 64 |
+
- **Location:** New York, USA
|
invention_description.md
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Technical Description of the Invention
|
| 2 |
+
|
| 3 |
+
**Invention Title:**
|
| 4 |
+
Method for Ternary-Quantized Transformer Optimization with Buffered Routing Embedding (BRE) and SwiGLU-Attention (SWA) Fusion for Low-VRAM Inference on Non-NVIDIA Hardware
|
| 5 |
+
|
| 6 |
+
**Inventor:** Konstantin Vladimirovich Grabko
|
| 7 |
+
**Contact:** grabko@cmsmanhattan.com | +1 (516) 777-0945
|
| 8 |
+
**Date of Conception:** December 2025
|
| 9 |
+
**Field of Invention:** Neural network architectures and optimization for efficient inference on non-NVIDIA hardware
|
| 10 |
+
**Confidentiality Notice:** Proprietary invention – not for public disclosure without a signed NDA.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## 1. Background of the Invention
|
| 15 |
+
|
| 16 |
+
Conventional Large Language Models (LLMs) rely on high-precision floating-point formats (FP16/BF16), which demand significant memory bandwidth and VRAM, typically necessitating expensive NVIDIA H100/A100 hardware. Existing quantization methods (4-bit/8-bit) often introduce perplexity loss or latency due to complex dequantization steps.
|
| 17 |
+
|
| 18 |
+
This invention addresses bottlenecks for non-NVIDIA hardware (AMD ROCm) by reimagining the transformer architecture at the weight, routing, and kernel levels.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. Summary of the Invention
|
| 23 |
+
|
| 24 |
+
The invention consists of a three-tier optimization stack:
|
| 25 |
+
- **Ternary Quantization:** Mapping weights to $\{-1, 0, +1\}$.
|
| 26 |
+
- **Buffered Routing Embedding (BRE):** Optimizing how tokens access memory.
|
| 27 |
+
- **SwiGLU-Attention (SWA) Fusion:** Combining compute-heavy layers into a single hardware kernel.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 3. Detailed Method Steps
|
| 32 |
+
|
| 33 |
+
### Tier 1: Ternary Weight Optimization
|
| 34 |
+
|
| 35 |
+
The model weights $W$ are constrained to a ternary set using a learnable scaling factor $\gamma$:
|
| 36 |
+
|
| 37 |
+
$$W_{quant} = \gamma \cdot \text{sign}(\text{clip}(W, -1, 1))$$
|
| 38 |
+
|
| 39 |
+
- **Process:** During training, a Straight-Through Estimator (STE) is used to pass gradients through the non-differentiable quantization function.
|
| 40 |
+
- **Benefit:** Reduces weight storage by $\approx 70\%$, allowing a 3B parameter model to fit into 20GB VRAM.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
### Tier 2: Buffered Routing Embedding (BRE)
|
| 45 |
+
|
| 46 |
+
Unlike standard embeddings that load full tables into active memory, BRE implements a dynamic routing mechanism:
|
| 47 |
+
- **Step A:** Tokens are analyzed for frequency and importance.
|
| 48 |
+
- **Step B:** High-frequency embeddings are cached in a dedicated HBM buffer.
|
| 49 |
+
- **Step C:** A routing logic directs the attention mechanism to the buffer, minimizing global memory fetches (HBM-to-Cache).
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
### Tier 3: SwiGLU-Attention (SWA) Fusion
|
| 54 |
+
|
| 55 |
+
In standard transformers, the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) are separate operations. This invention fuses them:
|
| 56 |
+
- **Mechanism:** The SwiGLU activation logic is integrated directly into the attention computation cycle.
|
| 57 |
+
- **Hardware Target:** Optimized for AMD’s CDNA/RDNA architectures using HIP kernels.
|
| 58 |
+
- **Result:** Thermal stability is maintained below $80^\circ\text{C}$ by reducing redundant register write-backs.
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## 4. Technical Advantages
|
| 63 |
+
|
| 64 |
+
| **Feature** | **Traditional Transformers** | **CMS Manhattan JiRack** |
|
| 65 |
+
|---------------------------|-----------------------------|-------------------------------------|
|
| 66 |
+
| **VRAM Usage (3B Model)** | ~45-60 GB (FP16) | ~20 GB (Ternary) |
|
| 67 |
+
| **Hardware Requirement** | NVIDIA Proprietary (CUDA) | Hardware Agnostic (ROCm/HIP) |
|
| 68 |
+
| **Operating Temp** | 85°C - 95°C | < 80°C |
|
| 69 |
+
| **Memory Bottleneck** | High (Global Memory Fetches)| Low (BRE-Buffered) |
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## 5. Claims (Summary)
|
| 74 |
+
|
| 75 |
+
1. A method for Ternary Quantization using learnable scaling $\gamma$ for transformer weights.
|
| 76 |
+
2. The architecture of Buffered Routing Embedding (BRE) for HBM memory management.
|
| 77 |
+
3. The fusion of SwiGLU and Attention layers into a single hardware-optimized kernel.
|
| 78 |
+
4. The application of these methods specifically for non-NVIDIA/ROCm inference pipelines.
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## 6. Conclusion
|
| 83 |
+
|
| 84 |
+
This invention represents a significant leap in "democratizing" AI, allowing state-of-the-art model performance on cost-effective, non-proprietary hardware without the traditional trade-offs in speed or accuracy.
|
performance_data.md
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Markdown# Performance Benchmarks and Test Results
|
| 2 |
+
|
| 3 |
+
**Invention Title:** Method for Ternary-Quantized Transformer Optimization with Buffered Routing Embedding and SWA Attention
|
| 4 |
+
|
| 5 |
+
**Inventor:** Konstantin Vladimirovich Grabko
|
| 6 |
+
**Test Date:** December 2025
|
| 7 |
+
**Hardware Tested:** AMD MI50 (32 GB HBM2), custom cooling
|
| 8 |
+
|
| 9 |
+
**Confidentiality Notice:** Internal test data – proprietary and not for publication.
|
| 10 |
+
-
|
| 11 |
+
- JiRackPyTorch_BitNet_class_3b.py
|
| 12 |
+
-
|
| 13 |
+
# ROCm System Management Interface
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Concise Info
|
| 18 |
+
|
| 19 |
+
| GPU | Temp (DieEdge) | AvgPwr | SCLK | MCLK | Fan | Perf | PwrCap | VRAM% | GPU% |
|
| 20 |
+
|-------|----------------|--------|------------|-----------|--------|--------|---------|-------|------|
|
| 21 |
+
| GPU[0] | 46.0c | N/A | 1725Mhz | 1000Mhz | 17.65% | auto | 225.0W | 59% | 100% |
|
| 22 |
+
|
| 23 |
+
- **GPU[0]:** `get_power_avg` is not supported on the given system.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
### End of ROCm SMI Log
|
| 28 |
+
-
|
| 29 |
+
- Step 4270 | Loss: 9.1875 | VRAM: 14.85GB | 11.9 t/s
|
| 30 |
+
- Step 4275 | Loss: 9.0000 | VRAM: 14.84GB | 11.9 t/s
|
| 31 |
+
- Step 4280 | Loss: 9.3125 | VRAM: 14.84GB | 11.9 t/s
|
| 32 |
+
- Step 4285 | Loss: 10.6875 | VRAM: 14.85GB | 11.9 t/s
|
| 33 |
+
- Step 4290 | Loss: 10.1250 | VRAM: 14.84GB | 11.9 t/s
|
| 34 |
+
- Step 4295 | Loss: 10.4375 | VRAM: 14.84GB | 11.9 t/s
|
| 35 |
+
- Step 4300 | Loss: 10.6250 | VRAM: 14.84GB | 11.9 t/s
|
| 36 |
+
- Step 4305 | Loss: 10.3125 | VRAM: 14.84GB | 11.9 t/s
|
| 37 |
+
- Step 4310 | Loss: 10.3750 | VRAM: 14.84GB | 11.9 t/s
|
| 38 |
+
- Step 4315 | Loss: 10.8750 | VRAM: 14.85GB | 11.9 t/s
|
| 39 |
+
- Step 4320 | Loss: 10.1875 | VRAM: 14.84GB | 11.9 t/s
|
| 40 |
+
- Step 4325 | Loss: 9.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 41 |
+
- Step 4330 | Loss: 9.7500 | VRAM: 14.84GB | 11.9 t/s
|
| 42 |
+
- Step 4335 | Loss: 8.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 43 |
+
- Step 4340 | Loss: 9.1875 | VRAM: 14.85GB | 11.9 t/s
|
| 44 |
+
- Step 4345 | Loss: 10.3125 | VRAM: 14.84GB | 11.9 t/s
|
| 45 |
+
- Step 4350 | Loss: 10.9375 | VRAM: 14.84GB | 11.9 t/s
|
| 46 |
+
- Step 4355 | Loss: 10.5000 | VRAM: 14.84GB | 11.9 t/s
|
| 47 |
+
- Step 4360 | Loss: 9.1250 | VRAM: 14.84GB | 11.9 t/s
|
| 48 |
+
- Step 4365 | Loss: 10.4375 | VRAM: 14.84GB | 11.9 t/s
|
| 49 |
+
- Step 4370 | Loss: 9.3125 | VRAM: 14.84GB | 11.9 t/s
|
| 50 |
+
- Step 4375 | Loss: 10.0625 | VRAM: 14.84GB | 11.9 t/s
|
| 51 |
+
- Step 4380 | Loss: 10.3125 | VRAM: 14.85GB | 11.9 t/s
|
| 52 |
+
- Step 4385 | Loss: 10.5000 | VRAM: 14.84GB | 11.9 t/s
|
| 53 |
+
- Step 4390 | Loss: 9.5000 | VRAM: 14.84GB | 11.9 t/s
|
| 54 |
+
- Step 4395 | Loss: 10.9375 | VRAM: 14.85GB | 11.9 t/s
|
| 55 |
+
- Step 4400 | Loss: 7.5312 | VRAM: 14.85GB | 11.9 t/s
|
| 56 |
+
- Step 4405 | Loss: 9.7500 | VRAM: 14.84GB | 11.9 t/s
|
| 57 |
+
- Step 4410 | Loss: 10.7500 | VRAM: 14.85GB | 11.9 t/s
|
| 58 |
+
- Step 4415 | Loss: 9.1875 | VRAM: 14.84GB | 11.9 t/s
|
| 59 |
+
- Step 4420 | Loss: 11.0000 | VRAM: 14.84GB | 11.9 t/s
|
| 60 |
+
- Step 4425 | Loss: 9.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 61 |
+
- Step 4430 | Loss: 10.3750 | VRAM: 14.84GB | 11.9 t/s
|
| 62 |
+
- Step 4435 | Loss: 10.8750 | VRAM: 14.84GB | 11.9 t/s
|
| 63 |
+
- Step 4440 | Loss: 10.9375 | VRAM: 14.85GB | 11.9 t/s
|
| 64 |
+
- Step 4445 | Loss: 10.0000 | VRAM: 14.84GB | 11.9 t/s
|
| 65 |
+
- Step 4450 | Loss: 9.1875 | VRAM: 14.84GB | 11.9 t/s
|
| 66 |
+
- Step 4455 | Loss: 9.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 67 |
+
- Step 4460 | Loss: 10.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 68 |
+
- Step 4465 | Loss: 10.4375 | VRAM: 14.85GB | 11.9 t/s
|
| 69 |
+
- Step 4470 | Loss: 10.4375 | VRAM: 14.84GB | 11.9 t/s
|
| 70 |
+
- Step 4475 | Loss: 9.5000 | VRAM: 14.84GB | 11.9 t/s
|
| 71 |
+
- Step 4480 | Loss: 9.8750 | VRAM: 14.85GB | 11.9 t/s
|
| 72 |
+
- Step 4485 | Loss: 8.1875 | VRAM: 14.85GB | 11.9 t/s
|
| 73 |
+
- Step 4490 | Loss: 11.1875 | VRAM: 14.84GB | 11.9 t/s
|
| 74 |
+
- Step 4495 | Loss: 10.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 75 |
+
- Step 4500 | Loss: 10.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 76 |
+
>>> SAVING: Checkpoint to ./models/ternary_3b_checkpoint_- Step_4500...
|
| 77 |
+
>>> CLEANUP: Removing old checkpoint ./models/ternary_3b_checkpoint_- Step_3000
|
| 78 |
+
- Step 4505 | Loss: 10.5000 | VRAM: 14.85GB | 11.9 t/s
|
| 79 |
+
- Step 4510 | Loss: 9.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 80 |
+
- Step 4515 | Loss: 9.8750 | VRAM: 14.84GB | 11.9 t/s
|
| 81 |
+
- Step 4520 | Loss: 9.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 82 |
+
- Step 4525 | Loss: 10.6250 | VRAM: 14.84GB | 11.9 t/s
|
| 83 |
+
- Step 4530 | Loss: 9.3750 | VRAM: 14.85GB | 11.9 t/s
|
| 84 |
+
- Step 4535 | Loss: 9.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 85 |
+
- Step 4540 | Loss: 10.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 86 |
+
- Step 4545 | Loss: 11.0000 | VRAM: 14.84GB | 11.9 t/s
|
| 87 |
+
- Step 4550 | Loss: 10.0000 | VRAM: 14.85GB | 11.9 t/s
|
| 88 |
+
- Step 4555 | Loss: 9.9375 | VRAM: 14.84GB | 11.9 t/s
|
| 89 |
+
- Step 4560 | Loss: 11.0625 | VRAM: 14.85GB | 11.9 t/s
|
| 90 |
+
- Step 4565 | Loss: 9.3125 | VRAM: 14.85GB | 11.9 t/s
|
| 91 |
+
- Step 4570 | Loss: 9.3750 | VRAM: 14.84GB | 11.9 t/s
|
| 92 |
+
- Step 4575 | Loss: 10.8125 | VRAM: 14.85GB | 11.9 t/s
|
| 93 |
+
- Step 4580 | Loss: 10.7500 | VRAM: 14.85GB | 11.9 t/s
|
| 94 |
+
- Step 4585 | Loss: 9.3750 | VRAM: 14.85GB | 11.9 t/s
|
| 95 |
+
- Step 4590 | Loss: 10.7500 | VRAM: 14.84GB | 11.9 t/s
|
| 96 |
+
- Step 4595 | Loss: 9.3125 | VRAM: 14.84GB | 11.9 t/s
|
| 97 |
+
- Step 4600 | Loss: 10.6250 | VRAM: 14.84GB | 11.9 t/s
|
| 98 |
+
- Step 4605 | Loss: 10.4375 | VRAM: 14.84GB | 11.9 t/s
|
| 99 |
+
- Step 4610 | Loss: 9.8750 | VRAM: 14.85GB | 11.9 t/s
|
| 100 |
+
- Step 4615 | Loss: 10.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 101 |
+
- Step 4620 | Loss: 10.0625 | VRAM: 14.85GB | 11.9 t/s
|
| 102 |
+
- Step 4625 | Loss: 10.6250 | VRAM: 14.84GB | 11.9 t/s
|
| 103 |
+
- Step 4630 | Loss: 10.7500 | VRAM: 14.85GB | 11.9 t/s
|
| 104 |
+
- Step 4635 | Loss: 10.5000 | VRAM: 14.84GB | 11.9 t/s
|
| 105 |
+
- Step 4640 | Loss: 10.0000 | VRAM: 14.85GB | 11.9 t/s
|
| 106 |
+
- Step 4645 | Loss: 10.9375 | VRAM: 14.84GB | 11.9 t/s
|
| 107 |
+
- Step 4650 | Loss: 10.6250 | VRAM: 14.84GB | 11.9 t/s
|
| 108 |
+
- Step 4655 | Loss: 9.6875 | VRAM: 14.85GB | 11.9 t/s
|
| 109 |
+
- Step 4660 | Loss: 9.5000 | VRAM: 14.85GB | 11.9 t/s
|
| 110 |
+
- Step 4665 | Loss: 10.8750 | VRAM: 14.84GB | 11.9 t/s
|
| 111 |
+
- Step 4670 | Loss: 11.0625 | VRAM: 14.84GB | 11.9 t/s
|
| 112 |
+
- Step 4675 | Loss: 10.8750 | VRAM: 14.84GB | 11.9 t/s
|
| 113 |
+
- Step 4680 | Loss: 9.2500 | VRAM: 14.84GB | 11.9 t/s
|
| 114 |
+
- Step 4685 | Loss: 9.0000 | VRAM: 14.85GB | 11.9 t/s
|
| 115 |
+
- Step 4690 | Loss: 10.5625 | VRAM: 14.84GB | 11.9 t/s
|
| 116 |
+
- Step 4695 | Loss: 10.1875 | VRAM: 14.84GB | 11.9 t/s
|
| 117 |
+
- Step 4700 | Loss: 8.6875 | VRAM: 14.85GB | 11.9 t/s
|
| 118 |
+
- Step 4705 | Loss: 10.7500 | VRAM: 14.85GB | 11.9 t/s
|
| 119 |
+
- Step 4710 | Loss: 9.2500 | VRAM: 14.84GB | 11.9 t/s
|
| 120 |
+
- Step 4715 | Loss: 8.3750 | VRAM: 14.84GB | 11.9 t/s
|
| 121 |
+
- Step 4720 | Loss: 9.9375 | VRAM: 14.84GB | 11.9 t/s
|
| 122 |
+
- Step 4725 | Loss: 10.8125 | VRAM: 14.84GB | 11.9 t/s
|
| 123 |
+
- Step 4730 | Loss: 9.8125 | VRAM: 14.84GB | 11.9 t/s
|
| 124 |
+
- Step 4735 | Loss: 9.2500 | VRAM: 14.84GB | 11.9 t/s
|
| 125 |
+
- Step 4740 | Loss: 10.6875 | VRAM: 14.84GB | 11.9 t/s
|
| 126 |
+
- Step 4745 | Loss: 10.1250 | VRAM: 14.84GB | 11.9 t/s
|
| 127 |
+
- Step 4750 | Loss: 10.6250 | VRAM: 14.84GB | 11.9 t/s
|
| 128 |
+
- Step 4755 | Loss: 10.8125 | VRAM: 14.85GB | 11.9 t/s
|
| 129 |
+
- Step 4760 | Loss: 10.4375 | VRAM: 14.85GB | 11.9 t/s
|
| 130 |
+
- Step 4765 | Loss: 10.3125 | VRAM: 14.84GB | 11.9 t/s
|
| 131 |
+
- Step 4770 | Loss: 9.5000 | VRAM: 14.84GB | 11.9 t/s
|
| 132 |
+
- Step 4775 | Loss: 10.0000 | VRAM: 14.85GB | 11.9 t/s
|
| 133 |
+
- Step 4780 | Loss: 9.5000 | VRAM: 14.85GB | 11.9 t/s
|
| 134 |
+
- Step 4785 | Loss: 9.0625 | VRAM: 14.84GB | 11.9 t/s
|