File size: 6,443 Bytes
343aa9f
78de0aa
343aa9f
8ab7feb
 
 
e8bddc9
343aa9f
e8bddc9
242b971
e8bddc9
343aa9f
e8bddc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343aa9f
 
 
e8bddc9
 
 
 
 
 
 
 
343aa9f
 
e8bddc9
 
 
343aa9f
e8bddc9
343aa9f
e8bddc9
343aa9f
e8bddc9
 
 
 
 
 
 
 
343aa9f
 
 
e8bddc9
 
74fc904
 
4a39971
74fc904
4a39971
4101d6e
 
 
 
 
059a03e
 
 
 
e8bddc9
343aa9f
4a39971
 
 
4101d6e
4a39971
4101d6e
 
4a39971
059a03e
4a39971
 
4101d6e
74fc904
 
 
4a39971
74fc904
4a39971
360b760
4a39971
4101d6e
4a39971
4101d6e
059a03e
4a39971
 
4101d6e
 
 
 
4a39971
4101d6e
4a39971
4101d6e
 
360b760
4a39971
360b760
059a03e
4a39971
 
4101d6e
74fc904
 
 
4a39971
 
 
35b06c4
4a39971
35b06c4
 
4a39971
059a03e
4a39971
 
4101d6e
74fc904
 
 
4a39971
 
 
 
 
 
 
 
 
059a03e
 
 
4a39971
 
 
74fc904
 
 
4a39971
 
74fc904
4a39971
 
 
 
 
 
 
 
 
 
 
 
74fc904
 
c2746f3
 
74fc904
 
e8bddc9
 
 
 
6f0e548
 
 
 
 
 
 
e8bddc9
343aa9f
ab5d6e8
 
 
 
 
 
 
74fc904
 
a3c06d6
343aa9f
e8bddc9
78de0aa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
license: gpl-3.0
---

JiRack_GPT3 is not Open AI model . It is class GPT-3 model

# Model Architecture Overview

## Architectures Included

I have added my empty models based on the following architectures:

- **GPT-3 Standard**
- **Llama 3**
- **Mistral**

For smaller models modeled after **GPT-2**, I utilize `LayerNorm` and `FFN` layers. For larger models, these layers are replaced with `RMSNorm` and `SwiGLU`, enabling a smoother transition to architectures with larger parameter sizes (8B, 33B, 70B, and 120B).

---

## Tokenizer Choices

- For English models: **GPT-2 Hugging Face tokenizer**
- For multilingual models: **BERT tokenizer** from the Hugging Face library

---

## Training and Tuning

The **Transformer block is not frozen**, providing greater flexibility and power when tuning models from scratch.

---

## Model Architecture Details

### GPT-2 Architecture (Classic, Transformer-like)

```
CustomEmbedding
FrozenSignatureLayer
LearnedPositionalEmbedding
[TransformerBlock]
    β”œβ”€β”€ MultiHeadAttention
    β”œβ”€β”€ LayerNorm
    β”œβ”€β”€ LayerNorm
    β”œβ”€β”€ FFN
          β”œβ”€β”€ Linear
          β”œβ”€β”€ Activation: GELU
          └── Linear
LayerNorm
Linear
```

---

### GPT-3 Architecture (Similar to Llama 3 & Mistral)

```
CustomEmbedding
# Positional Embedding removed, RoPE integrated in Attention
[TransformerBlock]
    β”œβ”€β”€ MultiHeadAttention
    β”œβ”€β”€ SwiGLUFeedForward
          β”œβ”€β”€ Linear (Gate Layer)
          β”œβ”€β”€ Linear (Up Layer)
          └── Linear (Projection/Down Layer)
    └── RMSNorm
RMSNorm
Linear
FrozenSignatureLayer
```

My LLMs 

# ========================================================
# Model Configuration (1B-class model)
# ========================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 2048
- NUM_HEADS = 32
- NUM_LAYERS = 16
- MAX_SEQ_LEN = 2048
- #RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Non-standard FFN (4D)
- HEAD_DIM = MODEL_DIM // NUM_HEADS #64
- EPSILON = 1e-6
---

# ============================================
# Model Configuration (31B-class model)
# ============================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 8192 # Large dimension (like Llama 2 70B)
- NUM_HEADS = 64
- NUM_LAYERS = 32
- MAX_SEQ_LEN = 8192 # Large context length
- # RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# =============================================
# Model Configuration (8B-class model)
# =============================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 4096 # Increased for 8.5B-class (Standard, High-Efficiency)
- NUM_HEADS = 32
- NUM_LAYERS = 40 # Increased to 40 (same as Llama 13B)
- MAX_SEQ_LEN = 2048
- # RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) # 10922 (Llama standard)
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# ==============================================
# Model Configuration (10B-class model)
# =================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 4096
- NUM_HEADS = 32
- NUM_LAYERS = 48 # Increased depth
- MAX_SEQ_LEN = 2048
- #RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) #10922 (Llama standard)
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# =====================================================================================
# Model Configuration (33B-class model) that is available by request
# ===========================================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 8192 # Large dimension (like Llama 2 70B)
- NUM_HEADS = 64
- NUM_LAYERS = 32
- MAX_SEQ_LEN = 8192 # Large context length
- # RoPE
- FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- EPSILON = 1e-6

---

# ====================================================================================
# 70B-Class Model Configuration (LLaMA-70B style) that available by request
# ====================================================================================
- VOCAB_SIZE = 50257
- MODEL_DIM = 8192 # Hidden size (d_model)
- NUM_HEADS = 64 # Q Heads
- NUM_KV_HEADS = 8 # KV Heads (GQA ratio = 8)
- NUM_LAYERS = 80 # 80 layers
- MAX_SEQ_LEN = 8192 # Max context (RoPE)
- # FFN LLaMA-70B Hidden Dim: 28672 (32768 * 2/3 + 32768 * 1/3 * 2/3 * 0.95, roughly 28672)
- # Exact value for LLaMA: 2 * (D * 2/3) + D * 2/3 * (1 - 2/3) * ~1.2 (for 70B)
- # Using the standard LLaMA-70B FFN for accuracy
- FFN_HIDDEN_DIM = 28672
- HEAD_DIM = MODEL_DIM // NUM_HEADS
- EPSILON = 1e-6

---
#
# JiRack Super Brain
# It was Designed military design and Discover worlds and learn space and science goals
#
# ====================================================================================
#140B Configuration (real numbers) that is available by request, JiRack Super Brain
# ====================================================================================
- VOCAB_SIZE = 32000
- MODEL_DIM = 12288 # d_model
- NUM_HEADS = 96 # Query heads
- NUM_KV_HEADS = 12 # GQA: 8Γ— groups
- NUM_LAYERS = 80
- HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
- FFN_HIDDEN_DIM = int(4 * MODEL_DIM * 1.3) #53248
- MAX_SEQ_LEN = 131072 # Max context
- EPSILON = 1e-6


- So About PyTorch script . You can use Pytorch script for AI classification task . 
- Do not Jit for Chatbot task . Use just state dict PyTorch for  GPT  (Chatbot) tasks


**Note:** The large model architectures replace specific layers:
- `LayerNorm` β†’ `RMSNorm`
- `FFN` β†’ `SwiGLU`

---
### JiRack RAG System
- It is microservice architecture with API Gateway and Service Discovery 
- Framework Spring boot and Google embeddings model for JiRack RAG System with Chatbot and JiRach model deployment with docker scipt 
- video https://www.youtube.com/watch?v=vHClQu76kMc
- RAG System https://bitbucket.org/cmsmanhattan/rag/src/main/

---

# install tokenizer before run 
---
- mkdir -p tokenizer
- wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
- wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
- wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
- wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json


Welcome to ask to design your corp model over 33B or 70B or more parameters

CMS Manhattan  
Copyright Β© 2002–2026