File size: 6,443 Bytes

343aa9f
78de0aa
343aa9f
8ab7feb
 
 
e8bddc9
343aa9f
e8bddc9
242b971
e8bddc9
343aa9f
e8bddc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343aa9f
 
 
e8bddc9
 
 
 
 
 
 
 
343aa9f
 
e8bddc9
 
 
343aa9f
e8bddc9
343aa9f
e8bddc9
343aa9f
e8bddc9
 
 
 
 
 
 
 
343aa9f
 
 
e8bddc9
 
74fc904
 
4a39971
74fc904
4a39971
4101d6e
 
 
 
 
059a03e
 
 
 
e8bddc9
343aa9f
4a39971
 
 
4101d6e
4a39971
4101d6e
 
4a39971
059a03e
4a39971
 
4101d6e
74fc904
 
 
4a39971
74fc904
4a39971
360b760
4a39971
4101d6e
4a39971
4101d6e
059a03e
4a39971
 
4101d6e
 
 
 
4a39971
4101d6e
4a39971
4101d6e
 
360b760
4a39971
360b760
059a03e
4a39971
 
4101d6e
74fc904
 
 
4a39971
 
 
35b06c4
4a39971
35b06c4
 
4a39971
059a03e
4a39971
 
4101d6e
74fc904
 
 
4a39971
 
 
 
 
 
 
 
 
059a03e
 
 
4a39971
 
 
74fc904
 
 
4a39971
 
74fc904
4a39971
 
 
 
 
 
 
 
 
 
 
 
74fc904
 
c2746f3
 
74fc904
 
e8bddc9
 
 
 
6f0e548
 
 
 
 
 
 
e8bddc9
343aa9f
ab5d6e8
 
 
 
 
 
 
74fc904
 
a3c06d6
343aa9f
e8bddc9
78de0aa

---
license: gpl-3.0
---

JiRack_GPT3 is not Open AI model . It is class GPT-3 model

# Model Architecture Overview

## Architectures Included

I have added my empty models based on the following architectures:

- **GPT-3 Standard**
- **Llama 3**
- **Mistral**

For smaller models modeled after **GPT-2**, I utilize `LayerNorm` and `FFN` layers. For larger models, these layers are replaced with `RMSNorm` and `SwiGLU`, enabling a smoother transition to architectures with larger parameter sizes (8B, 33B, 70B, and 120B).

---

## Tokenizer Choices

- For English models: **GPT-2 Hugging Face tokenizer**
- For multilingual models: **BERT tokenizer** from the Hugging Face library

---

## Training and Tuning

The **Transformer block is not frozen**, providing greater flexibility and power when tuning models from scratch.

---

## Model Architecture Details

### GPT-2 Architecture (Classic, Transformer-like)

```
CustomEmbedding
FrozenSignatureLayer
LearnedPositionalEmbedding
[TransformerBlock]
    ├── MultiHeadAttention
    ├── LayerNorm
    ├── LayerNorm
    ├── FFN
          ├── Linear
          ├── Activation: GELU
          └── Linear
LayerNorm
Linear
```

---

### GPT-3 Architecture (Similar to Llama 3 & Mistral)

```
CustomEmbedding
# Positional Embedding removed, RoPE integrated in Attention
[TransformerBlock]
    ├── MultiHeadAttention
    ├── SwiGLUFeedForward
          ├── Linear (Gate Layer)
          ├── Linear (Up Layer)
          └── Linear (Projection/Down Layer)
    └── RMSNorm
RMSNorm
Linear
FrozenSignatureLayer
```

My LLMs 

# ========================================================
# Model Configuration (1B-class model)
# ========================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 2048
- NUM_HEADS = 32
- NUM_LAYERS = 16
- MAX_SEQ_LEN = 2048
- #RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Non-standard FFN (4D)
- HEAD_DIM = MODEL_DIM // NUM_HEADS #64
- EPSILON = 1e-6
---

# ============================================
# Model Configuration (31B-class model)
# ============================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 8192 # Large dimension (like Llama 2 70B)
- NUM_HEADS = 64
- NUM_LAYERS = 32
- MAX_SEQ_LEN = 8192 # Large context length
- # RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# =============================================
# Model Configuration (8B-class model)
# =============================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 4096 # Increased for 8.5B-class (Standard, High-Efficiency)
- NUM_HEADS = 32
- NUM_LAYERS = 40 # Increased to 40 (same as Llama 13B)
- MAX_SEQ_LEN = 2048
- # RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) # 10922 (Llama standard)
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# ==============================================
# Model Configuration (10B-class model)
# =================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 4096
- NUM_HEADS = 32
- NUM_LAYERS = 48 # Increased depth
- MAX_SEQ_LEN = 2048
- #RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) #10922 (Llama standard)
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# =====================================================================================
# Model Configuration (33B-class model) that is available by request
# ===========================================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 8192 # Large dimension (like Llama 2 70B)
- NUM_HEADS = 64
- NUM_LAYERS = 32
- MAX_SEQ_LEN = 8192 # Large context length
- # RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# ====================================================================================
# 70B-Class Model Configuration (LLaMA-70B style) that available by request
# ====================================================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 8192 # Hidden size (d_model)
- NUM_HEADS = 64 # Q Heads
- NUM_KV_HEADS = 8 # KV Heads (GQA ratio = 8)
- NUM_LAYERS = 80 # 80 layers
- MAX_SEQ_LEN = 8192 # Max context (RoPE)
- # FFN LLaMA-70B Hidden Dim: 28672 (32768 * 2/3 + 32768 * 1/3 * 2/3 * 0.95, roughly 28672)
- # Exact value for LLaMA: 2 * (D * 2/3) + D * 2/3 * (1 - 2/3) * ~1.2 (for 70B)
- # Using the standard LLaMA-70B FFN for accuracy
- FFN_HIDDEN_DIM = 28672
- HEAD_DIM = MODEL_DIM // NUM_HEADS
- EPSILON = 1e-6

---
#
# JiRack Super Brain
# It was Designed military design and Discover worlds and learn space and science goals
#
# ====================================================================================
#140B Configuration (real numbers) that is available by request, JiRack Super Brain
# ====================================================================================
- VOCAB_SIZE = 32000
- MODEL_DIM = 12288 # d_model
- NUM_HEADS = 96 # Query heads
- NUM_KV_HEADS = 12 # GQA: 8× groups
- NUM_LAYERS = 80
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- FFN_HIDDEN_DIM = int(4 * MODEL_DIM * 1.3) #53248
- MAX_SEQ_LEN = 131072 # Max context
- EPSILON = 1e-6


- So About PyTorch script . You can use Pytorch script for AI classification task . 
- Do not Jit for Chatbot task . Use just state dict PyTorch for  GPT  (Chatbot) tasks


**Note:** The large model architectures replace specific layers:
- `LayerNorm` → `RMSNorm`
- `FFN` → `SwiGLU`

---
### JiRack RAG System
- It is microservice architecture with API Gateway and Service Discovery 
- Framework Spring boot and Google embeddings model for JiRack RAG System with Chatbot and JiRach model deployment with docker scipt 
- video https://www.youtube.com/watch?v=vHClQu76kMc
- RAG System https://bitbucket.org/cmsmanhattan/rag/src/main/

---

# install tokenizer before run 
---
- mkdir -p tokenizer
- wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
- wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
- wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
- wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json


Welcome to ask to design your corp model over 33B or 70B or more parameters

CMS Manhattan  
Copyright © 2002–2026