JiRack_GPT3 is not Open AI model . It is class GPT-3 model

Model Architecture Overview

Architectures Included

I have added my empty models based on the following architectures:

  • GPT-3 Standard
  • Llama 3
  • Mistral

For smaller models modeled after GPT-2, I utilize LayerNorm and FFN layers. For larger models, these layers are replaced with RMSNorm and SwiGLU, enabling a smoother transition to architectures with larger parameter sizes (8B, 33B, 70B, and 120B).


Tokenizer Choices

  • For English models: GPT-2 Hugging Face tokenizer
  • For multilingual models: BERT tokenizer from the Hugging Face library

Training and Tuning

The Transformer block is not frozen, providing greater flexibility and power when tuning models from scratch.


Model Architecture Details

GPT-2 Architecture (Classic, Transformer-like)

CustomEmbedding
FrozenSignatureLayer
LearnedPositionalEmbedding
[TransformerBlock]
    β”œβ”€β”€ MultiHeadAttention
    β”œβ”€β”€ LayerNorm
    β”œβ”€β”€ LayerNorm
    β”œβ”€β”€ FFN
          β”œβ”€β”€ Linear
          β”œβ”€β”€ Activation: GELU
          └── Linear
LayerNorm
Linear

GPT-3 Architecture (Similar to Llama 3 & Mistral)

CustomEmbedding
# Positional Embedding removed, RoPE integrated in Attention
[TransformerBlock]
    β”œβ”€β”€ MultiHeadAttention
    β”œβ”€β”€ SwiGLUFeedForward
          β”œβ”€β”€ Linear (Gate Layer)
          β”œβ”€β”€ Linear (Up Layer)
          └── Linear (Projection/Down Layer)
    └── RMSNorm
RMSNorm
Linear
FrozenSignatureLayer

My LLMs

========================================================

Model Configuration (1B-class model)

========================================================

  • VOCAB_SIZE = 50257
  • MODEL_DIM = 2048
  • NUM_HEADS = 32
  • NUM_LAYERS = 16
  • MAX_SEQ_LEN = 2048
  • #RoPE
  • FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Non-standard FFN (4D)
  • HEAD_DIM = MODEL_DIM // NUM_HEADS #64
  • EPSILON = 1e-6

============================================

Model Configuration (31B-class model)

============================================

  • VOCAB_SIZE = 50257
  • MODEL_DIM = 8192 # Large dimension (like Llama 2 70B)
  • NUM_HEADS = 64
  • NUM_LAYERS = 32
  • MAX_SEQ_LEN = 8192 # Large context length
  • RoPE

  • FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768
  • HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
  • EPSILON = 1e-6

=============================================

Model Configuration (8B-class model)

=============================================

  • VOCAB_SIZE = 50257
  • MODEL_DIM = 4096 # Increased for 8.5B-class (Standard, High-Efficiency)
  • NUM_HEADS = 32
  • NUM_LAYERS = 40 # Increased to 40 (same as Llama 13B)
  • MAX_SEQ_LEN = 2048
  • RoPE

  • FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) # 10922 (Llama standard)
  • HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
  • EPSILON = 1e-6

==============================================

Model Configuration (10B-class model)

=================================================

  • VOCAB_SIZE = 50257
  • MODEL_DIM = 4096
  • NUM_HEADS = 32
  • NUM_LAYERS = 48 # Increased depth
  • MAX_SEQ_LEN = 2048
  • #RoPE
  • FFN_HIDDEN_DIM = int(MODEL_DIM * 8 / 3) #10922 (Llama standard)
  • HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
  • EPSILON = 1e-6

=====================================================================================

Model Configuration (33B-class model) that is available by request

===========================================================================

  • VOCAB_SIZE = 50257
  • MODEL_DIM = 8192 # Large dimension (like Llama 2 70B)
  • NUM_HEADS = 64
  • NUM_LAYERS = 32
  • MAX_SEQ_LEN = 8192 # Large context length
  • RoPE

  • FFN_HIDDEN_DIM = int(MODEL_DIM * 4) # Custom FFN (4D) - 32768
  • HEAD_DIM = MODEL_DIM // NUM_HEADS # 128
  • EPSILON = 1e-6

====================================================================================

70B-Class Model Configuration (LLaMA-70B style) that available by request

====================================================================================

  • VOCAB_SIZE = 50257
  • MODEL_DIM = 8192 # Hidden size (d_model)
  • NUM_HEADS = 64 # Q Heads
  • NUM_KV_HEADS = 8 # KV Heads (GQA ratio = 8)
  • NUM_LAYERS = 80 # 80 layers
  • MAX_SEQ_LEN = 8192 # Max context (RoPE)
  • FFN LLaMA-70B Hidden Dim: 28672 (32768 * 2/3 + 32768 * 1/3 * 2/3 * 0.95, roughly 28672)

  • Exact value for LLaMA: 2 * (D * 2/3) + D * 2/3 * (1 - 2/3) * ~1.2 (for 70B)

  • Using the standard LLaMA-70B FFN for accuracy

  • FFN_HIDDEN_DIM = 28672
  • HEAD_DIM = MODEL_DIM // NUM_HEADS
  • EPSILON = 1e-6

JiRack Super Brain

It was Designed military design and Discover worlds and learn space and science goals

====================================================================================

#140B Configuration (real numbers) that is available by request, JiRack Super Brain

====================================================================================

  • VOCAB_SIZE = 32000

  • MODEL_DIM = 12288 # d_model

  • NUM_HEADS = 96 # Query heads

  • NUM_KV_HEADS = 12 # GQA: 8Γ— groups

  • NUM_LAYERS = 80

  • HEAD_DIM = MODEL_DIM // NUM_HEADS # 128

  • FFN_HIDDEN_DIM = int(4 * MODEL_DIM * 1.3) #53248

  • MAX_SEQ_LEN = 131072 # Max context

  • EPSILON = 1e-6

  • So About PyTorch script . You can use Pytorch script for AI classification task .

  • Do not Jit for Chatbot task . Use just state dict PyTorch for GPT (Chatbot) tasks

Note: The large model architectures replace specific layers:

  • LayerNorm β†’ RMSNorm
  • FFN β†’ SwiGLU

JiRack RAG System


install tokenizer before run


Welcome to ask to design your corp model over 33B or 70B or more parameters

CMS Manhattan
Copyright Β© 2002–2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support