GAD-2: Generative Autogressive Decoder (Version 2)

GAD-2 (177M Parameters) is a high-efficiency, hybrid-architecture language model designed for rapid context acquisition and agentic reasoning. It represents a significant evolution in the GAD series, moving from specialized technical modeling (GAD-1) to broad-spectrum linguistic understanding.

πŸš€ The GAD-2 Leap: Comparison with GAD-1

While GAD-1 focused on niche scientific data (Astronomy) with a 77M parameter footprint, GAD-2 is built to be a general-purpose backbone with vastly improved stability and reasoning capabilities.

Feature GAD-1 (Legacy) GAD-2 (Current) Status
Model Size 77M Parameters 177M Parameters +130% Growth
Context Window 512 Tokens 1024 Tokens 2x Capacity
Architecture GAD-v1 (Angetic Core) GAD-v2 (Agentic Core) Refined
Primary Data Wikipedia FineWeb (General) Broad Scale
Training Throughput Standard 12,800 Batches High-Speed
Stability Full RMSNorm & RoPE Full RMSNorm & RoPE Enterprise

🧠 Architectural Innovation

GAD-2 is not just a larger transformer; it introduces the Agentic Coreβ€”a hybrid system that mimics cognitive planning before token generation.

1. Multi-Intent Evolver (MIE)

Unlike static attention layers, GAD-2 uses parallel GRU-based Evolvers. This allows the model to "track" shifting intents across a sequence, preventing the "forgetting" typical of small-scale transformers.

2. Adaptive Memory Module

A learnable, persistent memory bank that updates dynamically during training. This module acts as a global anchor, allowing the model to maintain coherence across its entire 1024 context window.

3. RoPE & SwiGLU

By implementing Rotary Positional Embeddings (RoPE) and SwiGLU activation, GAD-2 achieves a level of syntactic precision usually reserved for models in the 1B+ parameter range.


πŸ“Š Training Meta-Data

  • Dataset: HuggingFaceFW/fineweb (High-quality web crawl)
  • Total Batches Processed: 12,800
  • Optimization Steps: 800 (Effective via Gradient Accumulation: 16)
  • Training Time: ~2 Hours (Extreme Convergence)
  • Precision: 16-bit Mixed Precision (AMP)
  • Loss Performance: 7.5 β†’ 6.6 (Highly stable descent)

πŸ’» Usage

To use GAD-2, you must enable trust_remote_code=True as the model utilizes custom Agentic layers.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Raziel1234/GAD-2"

tokenizer = AutoTokenizer.from_pretrained(model_id)

print("Downloading and loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True,
    force_download=True 
).to("cuda")

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

print(f"\nPrompt: {prompt}")
print("-" * 30)

with torch.no_grad():
    output = model.generate(
        **inputs, 
        max_length=100, 
        do_sample=True, 
        temperature=0.8, 
        top_k=50
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"GAD-2 Output: {generated_text}")
Downloads last month
205
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Dataset used to train Raziel1234/GAD-2

Space using Raziel1234/GAD-2 1

Collection including Raziel1234/GAD-2