CMSManhattan
/

JiRack_GPT3_empty

Model card Files Files and versions

xet

Community

kgrabko commited on Nov 29, 2025

Commit

e8bddc9

verified ·

1 Parent(s): 242b971

Update README.md

Browse files

Files changed (1) hide show

README.md +61 -22

README.md CHANGED Viewed

@@ -1,42 +1,81 @@
 ---
 license: apache-2.0
 ---
-I have added my empty models using GPT-3 Standard as well as Llama 3 and Mistral architectures.
-My smaller GPT-2 models utilize LayerNorm and FFN layers, whereas for larger models,
-I have replaced these components with RMSNorm and SwiGLU. This adjustment allows a smoother transition to large model architectures,
-including models with 8B, 33B, 70B, and 120B parameters.
-So please GPT-2 huggingface tokenizer for english and for multi languages bert tokenizer from huggingface library .
-Transformer block is not frozen that give more power to tune model from scratch
-My GPT-2 Archtecure similar classic GPT-2 transfomer
 CustomEmbedding
 FrozenSignatureLayer
 LearnedPositionalEmbedding
-TransformerBlock ->  MultiHeadAttention
-TransformerBlock ->  LayerNorm
-TransformerBlock ->  LayerNorm
-TransformerBlock ->  ffn --> Linear
-TransformerBlock ->  ffn --> Activation::gelu
-TransformerBlock ->  ffn --> Linear
 LayerNorm
 Linear
-My GPT-3 Archtecure similar  as LLAMA 3 and Mistral
 CustomEmbedding
-# Positional Embedding removed, RoPE intergared in Attention
-TransformerBlock ->  MultiHeadAttention
-TransformerBlock ->  SwiGLUFeedForward - > Linear # Gate Layer
-TransformerBlock ->  SwiGLUFeedForward - > Linear # Up Layer
-TransformerBlock ->  SwiGLUFeedForward - > Linear # Projection (Down) Layer
-TransformerBlock ->  RMSNorm
 RMSNorm
 Linear
 FrozenSignatureLayer
-CMS Manhattan  Copyright: Copyright (c) 2002-2026
-It was replaced layers with

 ---
 license: apache-2.0
 ---
+# Model Architecture Overview
+## Architectures Included
+I have added my empty models based on the following architectures:
+- **GPT-3 Standard**
+- **Llama 3**
+- **Mistral**
+For smaller models modeled after **GPT-2**, I utilize `LayerNorm` and `FFN` layers. For larger models, these layers are replaced with `RMSNorm` and `SwiGLU`, enabling a smoother transition to architectures with larger parameter sizes (8B, 33B, 70B, and 120B).
+---
+## Tokenizer Choices
+- For English models: **GPT-2 Hugging Face tokenizer**
+- For multilingual models: **BERT tokenizer** from the Hugging Face library
+---
+## Training and Tuning
+The **Transformer block is not frozen**, providing greater flexibility and power when tuning models from scratch.
+---
+## Model Architecture Details
+### GPT-2 Architecture (Classic, Transformer-like)
+```
 CustomEmbedding
 FrozenSignatureLayer
 LearnedPositionalEmbedding
+[TransformerBlock]
+    ├── MultiHeadAttention
+    ├── LayerNorm
+    ├── LayerNorm
+    ├── FFN
+          ├── Linear
+          ├── Activation: GELU
+          └── Linear
 LayerNorm
 Linear
+```
+---
+### GPT-3 Architecture (Similar to Llama 3 & Mistral)
+```
 CustomEmbedding
+# Positional Embedding removed, RoPE integrated in Attention
+[TransformerBlock]
+    ├── MultiHeadAttention
+    ├── SwiGLUFeedForward
+          ├── Linear (Gate Layer)
+          ├── Linear (Up Layer)
+          └── Linear (Projection/Down Layer)
+    └── RMSNorm
 RMSNorm
 Linear
 FrozenSignatureLayer
+```
+---
+**Note:** The large model architectures replace specific layers:
+- `LayerNorm` → `RMSNorm`
+- `FFN` → `SwiGLU`
+---
+## License
+CMS Manhattan
+Copyright © 2002–2026