GAD-2: Generative Autogressive Decoder (Version 2)

GAD-2 (177M Parameters) is a high-efficiency, hybrid-architecture language model designed for rapid context acquisition and agentic reasoning. It represents a significant evolution in the GAD series, moving from specialized technical modeling (GAD-1) to broad-spectrum linguistic understanding.

🚀 The GAD-2 Leap: Comparison with GAD-1

While GAD-1 focused on niche scientific data (Astronomy) with a 77M parameter footprint, GAD-2 is built to be a general-purpose backbone with vastly improved stability and reasoning capabilities.

Feature	GAD-1 (Legacy)	GAD-2 (Current)	Status
Model Size	77M Parameters	177M Parameters	+130% Growth
Context Window	512 Tokens	1024 Tokens	2x Capacity
Architecture	GAD-v1 (Angetic Core)	GAD-v2 (Agentic Core)	Refined
Primary Data	Wikipedia	FineWeb (General)	Broad Scale
Training Throughput	Standard	12,800 Batches	High-Speed
Stability	Full RMSNorm & RoPE	Full RMSNorm & RoPE	Enterprise

🧠 Architectural Innovation

GAD-2 is not just a larger transformer; it introduces the Agentic Core—a hybrid system that mimics cognitive planning before token generation.

1. Multi-Intent Evolver (MIE)

Unlike static attention layers, GAD-2 uses parallel GRU-based Evolvers. This allows the model to "track" shifting intents across a sequence, preventing the "forgetting" typical of small-scale transformers.

2. Adaptive Memory Module

A learnable, persistent memory bank that updates dynamically during training. This module acts as a global anchor, allowing the model to maintain coherence across its entire 1024 context window.

3. RoPE & SwiGLU

By implementing Rotary Positional Embeddings (RoPE) and SwiGLU activation, GAD-2 achieves a level of syntactic precision usually reserved for models in the 1B+ parameter range.

📊 Training Meta-Data

Dataset: HuggingFaceFW/fineweb (High-quality web crawl)
Total Batches Processed: 12,800
Optimization Steps: 800 (Effective via Gradient Accumulation: 16)
Training Time: ~2 Hours (Extreme Convergence)
Precision: 16-bit Mixed Precision (AMP)
Loss Performance: 7.5 → 6.6 (Highly stable descent)

💻 Usage

To use GAD-2, you must enable trust_remote_code=True as the model utilizes custom Agentic layers.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Raziel1234/GAD-2"

tokenizer = AutoTokenizer.from_pretrained(model_id)

print("Downloading and loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True,
    force_download=True 
).to("cuda")

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

print(f"\nPrompt: {prompt}")
print("-" * 30)

with torch.no_grad():
    output = model.generate(
        **inputs, 
        max_length=100, 
        do_sample=True, 
        temperature=0.8, 
        top_k=50
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"GAD-2 Output: {generated_text}")

Downloads last month: 12

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Raziel1234/GAD-2

Finetunes

1 model

Dataset used to train Raziel1234/GAD-2

Space using Raziel1234/GAD-2 1

Collection including Raziel1234/GAD-2

GAD

Collection

GAD language models family • 4 items • Updated Jan 26 • 1