--- license: gpl-3.0 --- JiRack_GPT3 is not Open AI model . It is class GPT-3 model # Model Architecture Overview ## Architectures Included I have added my empty models based on the following architectures: - **GPT-3 Standard** - **Llama 3** - **Mistral** For smaller models modeled after **GPT-2**, I utilize `LayerNorm` and `FFN` layers. For larger models, these layers are replaced with `RMSNorm` and `SwiGLU`, enabling a smoother transition to architectures with larger parameter sizes (8B, 33B, 70B, and 120B). --- ## Tokenizer Choices - For English models: **GPT-2 Hugging Face tokenizer** - For multilingual models: **BERT tokenizer** from the Hugging Face library --- ## Training and Tuning The **Transformer block is not frozen**, providing greater flexibility and power when tuning models from scratch. --- ## Model Architecture Details ### GPT-2 Architecture (Classic, Transformer-like) ``` CustomEmbedding FrozenSignatureLayer LearnedPositionalEmbedding [TransformerBlock] ├── MultiHeadAttention ├── LayerNorm ├── LayerNorm ├── FFN ├── Linear ├── Activation: GELU └── Linear LayerNorm Linear ``` --- ### GPT-3 Architecture (Similar to Llama 3 & Mistral) ``` CustomEmbedding # Positional Embedding removed, RoPE integrated in Attention [TransformerBlock] ├── MultiHeadAttention ├── SwiGLUFeedForward ├── Linear (Gate Layer) ├── Linear (Up Layer) └── Linear (Projection/Down Layer) └── RMSNorm RMSNorm Linear FrozenSignatureLayer ``` My LLMs # ======================================================== # Model Configuration (1B-class model) # ======================================================== - VOCAB_SIZE = 50257 - MODEL_DIM = 2048 - NUM_HEADS = 32 - NUM_LAYERS = 16 - MAX_SEQ_LEN = 2048 - #RoPE - FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Non-standard FFN (4D) - HEAD_DIM = MODEL_DIM // NUM_HEADS #64 - EPSILON = 1e-6 --- # ============================================ # Model Configuration (31B-class model) # ============================================ - VOCAB_SIZE = 50257 - MODEL_DIM = 8192 # Large dimension (like Llama 2 70B) - NUM_HEADS = 64 - NUM_LAYERS = 32 - MAX_SEQ_LEN = 8192 # Large context length - # RoPE - FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768 - HEAD_DIM = MODEL_DIM // NUM_HEADS # 128 - EPSILON = 1e-6 --- # ============================================= # Model Configuration (8B-class model) # ============================================= - VOCAB_SIZE = 50257 - MODEL_DIM = 4096 # Increased for 8.5B-class (Standard, High-Efficiency) - NUM_HEADS = 32 - NUM_LAYERS = 40 # Increased to 40 (same as Llama 13B) - MAX_SEQ_LEN = 2048 - # RoPE - FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) # 10922 (Llama standard) - HEAD_DIM = MODEL_DIM // NUM_HEADS # 128 - EPSILON = 1e-6 --- # ============================================== # Model Configuration (10B-class model) # ================================================= - VOCAB_SIZE = 50257 - MODEL_DIM = 4096 - NUM_HEADS = 32 - NUM_LAYERS = 48 # Increased depth - MAX_SEQ_LEN = 2048 - #RoPE - FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) #10922 (Llama standard) - HEAD_DIM = MODEL_DIM // NUM_HEADS # 128 - EPSILON = 1e-6 --- # ===================================================================================== # Model Configuration (33B-class model) that is available by request # =========================================================================== - VOCAB_SIZE = 50257 - MODEL_DIM = 8192 # Large dimension (like Llama 2 70B) - NUM_HEADS = 64 - NUM_LAYERS = 32 - MAX_SEQ_LEN = 8192 # Large context length - # RoPE - FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768 - HEAD_DIM = MODEL_DIM // NUM_HEADS # 128 - EPSILON = 1e-6 --- # ==================================================================================== # 70B-Class Model Configuration (LLaMA-70B style) that available by request # ==================================================================================== - VOCAB_SIZE = 50257 - MODEL_DIM = 8192 # Hidden size (d_model) - NUM_HEADS = 64 # Q Heads - NUM_KV_HEADS = 8 # KV Heads (GQA ratio = 8) - NUM_LAYERS = 80 # 80 layers - MAX_SEQ_LEN = 8192 # Max context (RoPE) - # FFN LLaMA-70B Hidden Dim: 28672 (32768 * 2/3 + 32768 * 1/3 * 2/3 * 0.95, roughly 28672) - # Exact value for LLaMA: 2 * (D * 2/3) + D * 2/3 * (1 - 2/3) * ~1.2 (for 70B) - # Using the standard LLaMA-70B FFN for accuracy - FFN_HIDDEN_DIM = 28672 - HEAD_DIM = MODEL_DIM // NUM_HEADS - EPSILON = 1e-6 --- # # JiRack Super Brain # It was Designed military design and Discover worlds and learn space and science goals # # ==================================================================================== #140B Configuration (real numbers) that is available by request, JiRack Super Brain # ==================================================================================== - VOCAB_SIZE = 32000 - MODEL_DIM = 12288 # d_model - NUM_HEADS = 96 # Query heads - NUM_KV_HEADS = 12 # GQA: 8× groups - NUM_LAYERS = 80 - HEAD_DIM = MODEL_DIM // NUM_HEADS # 128 - FFN_HIDDEN_DIM = int(4 * MODEL_DIM * 1.3) #53248 - MAX_SEQ_LEN = 131072 # Max context - EPSILON = 1e-6 - So About PyTorch script . You can use Pytorch script for AI classification task . - Do not Jit for Chatbot task . Use just state dict PyTorch for GPT (Chatbot) tasks **Note:** The large model architectures replace specific layers: - `LayerNorm` → `RMSNorm` - `FFN` → `SwiGLU` --- ### JiRack RAG System - It is microservice architecture with API Gateway and Service Discovery - Framework Spring boot and Google embeddings model for JiRack RAG System with Chatbot and JiRach model deployment with docker scipt - video https://www.youtube.com/watch?v=vHClQu76kMc - RAG System https://bitbucket.org/cmsmanhattan/rag/src/main/ --- # install tokenizer before run --- - mkdir -p tokenizer - wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json - wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json - wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt - wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json Welcome to ask to design your corp model over 33B or 70B or more parameters CMS Manhattan Copyright © 2002–2026