GGUF
conversational
How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Abhinav-Tyagi/synapse_v2",
	filename="synapse-full-f16.gguf",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

language: - en license: mit tags: - transformer - decoder-only - from-scratch - bpe-tokenizer - instruction-tuning - educational - nlp - pytorch - small-language-model - custom-architecture pipeline_tag: text-generation

Synapse v2 β€” Decoder-Only Transformer Built from First Principles

Built by Abhinav Tyagi
πŸ“„ GitHub β€’ πŸ’Ό LinkedIn β€’ πŸ“œ Research Paper (PDF)


Overview

Synapse v2 is a 3.67M parameter decoder-only Transformer built entirely from scratch β€” no HuggingFace Trainer, no pre-built architecture. Every component is implemented manually: the attention mechanism, BPE tokenizer, positional embeddings, training loop, and inference pipeline.

This is the second generation of the Synapse series, representing a 4.6Γ— parameter scale-up and 3Γ— depth increase from Synapse v1 β€” with a core focus on transitioning from memorization to generalization.

"This work prioritizes understanding over performance. Building from first principles reveals what production models abstract away." β€” Abhinav Tyagi


Evolution: v1 β†’ v2

Metric Synapse v1 Synapse v2 Factor
Parameters 800K 3.67M 4.6Γ—
Layers 4 12 3Γ—
Vocabulary 1,500 (basic BPE) 1,037 (Turbo BPE) Professional
Context Window 128 tokens 64 tokens Optimized
Regularization None Dropout 0.1 Added
Training Loss 0.05 (memorization) 2.04 Better generalization
Validation Loss Not measured 3.26 Tracked
Perplexity 1.05 (overfit) 7.7 train / 26.1 val Learned patterns
Capability Text continuation Instruction following Functional

The core lesson: systematic scaling + regularization + quality data = generalization.


Architecture

Raw Text
  ↓
Turbo BPE Tokenization (GPT-2 Regex Pattern)
  ↓
Token Embeddings (1037 β†’ 128)
  +
Positional Embeddings (64 β†’ 128)
  ↓
12Γ— Transformer Blocks
  [LayerNorm β†’ Multi-Head Attention (4 heads) β†’ Residual]
  [LayerNorm β†’ Feed-Forward (128β†’512β†’128) β†’ Residual]
  ↓
Final LayerNorm
  ↓
LM Head (128 β†’ 1037)
  ↓
Next Token

Model Specs

Component Value Rationale
Model Dimension 128 Sweet spot for 3–5M parameter range
Attention Heads 4 (32-dim each) Optimal for 128-dim model
Transformer Layers 12 Enables hierarchical feature learning
Context Window 64 tokens Optimized for dialogue efficiency
Vocabulary Size 1,037 (Turbo BPE) Professional compression
FFN Hidden Size 512 (4Γ— expansion) Standard Transformer ratio
Dropout 0.1 Regularization without over-damping
Total Parameters 3,672,832 CPU-trainable, interpretable

Parameter Breakdown

Component Parameters %
Token Embedding [1037Γ—128] 132,736 3.6%
Position Embedding [64Γ—128] 8,192 0.2%
12Γ— Transformer Blocks ~3,480,000 94.7%
β€” Self-Attention per block ~49,000 β€”
β€” Feed-Forward per block ~131,000 β€”
β€” LayerNorms per block 512 β€”
Final LayerNorm 256 0.0%
LM Head [128Γ—1037] 132,736 3.6%
Total 3,672,832 100%

Tokenizer: Turbo BPE

Built from scratch β€” no dependency on HuggingFace or tiktoken.

Key innovation: GPT-2 regex pre-tokenization pattern (same pattern used by GPT-2, GPT-3, GPT-4, Llama):

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

This ensures correct handling of contractions, numbers, punctuation, and whitespace β€” preventing cross-boundary merges that degrade tokenization quality.

Optimizations implemented:

  • Doubly-linked list for O(1) merge updates (vs O(n) list rebuilding)
  • Hash map for O(1) pair lookup
  • Position sets to track all occurrences efficiently
  • Rank-based encoding (deterministic, matches GPT-2/GPT-4 behavior)

Compression achieved: 3–4Γ— sequence length reduction.

Example:

"don't" β†’ Basic BPE: ['d','o','n',"'",'t']  # 5 tokens
"don't" β†’ Turbo BPE: ['don', "'t"]           # 2 tokens βœ…

Training

Dataset

Hybrid instruction-tuned corpus (~18–20K pairs, ~200–500K tokens):

Component Size Purpose
Dolly-15k ~15K instructions Reasoning, QA, summarization
Creator Profile ~500 examples Identity grounding
GenAI Knowledge ~2K examples Technical expertise
Domain Facts ~1K examples Grounded world knowledge

Training Configuration

Hyperparameter Value
Optimizer Adam
Learning Rate 1e-3
Batch Size 4
Training Steps 5,000
Dropout 0.1
Context Length 64 tokens

Training Results

Metric Value
Initial Loss 8.54
Final Training Loss 2.04
Final Validation Loss 3.26
Train Perplexity 7.7
Validation Perplexity 26.1

Starting from random initialization (loss 8.54 β‰ˆ log(1037)), the model converged to 2.04 β€” demonstrating genuine learning, not memorization. The train/val gap (2.04 vs 3.26) confirms effective regularization via dropout.


Key Design Decisions Explained

Why 12 layers? Enables hierarchical learning: syntax (lower layers) β†’ semantics (middle) β†’ reasoning (upper). Empirically validated as the sweet spot for ~4M parameter models.

Why pre-norm (LayerNorm before attention)? More stable training β€” gradients don't explode or vanish as easily. Standard in modern architectures (GPT-3, Claude). Allows higher learning rates.

Why dropout 0.1? The v1 model had a perplexity of 1.05 β€” pure memorization, useless for generalization. Dropout forced the model to learn robust patterns. The validation gap (3.26 vs 2.04) proves it worked.

Why 64-token context (smaller than v1's 128)? Most conversational turns fit in 50–80 tokens. Reduces O(TΒ²) attention cost by 4Γ— β€” faster training, same capability for dialogue tasks.


Usage

import torch
from model import TinyGPT
from tokenizer import TurboBPE

# Load model
model = TinyGPT(
    vocab_size=1037,
    n_embd=128,
    n_head=4,
    n_layer=12,
    block_size=64,
    dropout=0.0  # Disable dropout at inference
)
model.load_state_dict(torch.load("synapse_v2.pt", map_location="cpu"))
model.eval()

# Load tokenizer
tokenizer = TurboBPE.load("tokenizer_state.json")

# Generate
prompt = "Instruction: What is a Transformer model?\nResponse:"
tokens = tokenizer.encode(prompt)
input_ids = torch.tensor([tokens])

with torch.no_grad():
    for _ in range(100):
        logits, _ = model(input_ids)
        next_token = torch.argmax(logits[:, -1, :], dim=-1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)

print(tokenizer.decode(input_ids[0].tolist()))

Lessons Learned

Building Synapse v2 from scratch produced insights that using pre-built frameworks hides:

  1. Tokenization quality is foundational β€” Bad tokenization forces the model to waste capacity learning token boundaries instead of meaning
  2. Dropout is not optional at small scale β€” Without it, small models memorize training data completely (perplexity 1.05 in v1)
  3. Pre-norm vs post-norm matters β€” Pre-norm enabled stable training at 12 layers; post-norm would have required careful LR tuning
  4. Validation loss is the only truth β€” Training loss is meaningless without a validation signal
  5. Instruction formatting teaches task structure β€” The model learns what a "question" and "answer" look like before it learns the content

Synapse Series

Model Description Params
Synapse v1 First principles transformer 800K
Synapse v2 (this) Scaled + instruction-tuned 3.67M
Synapse SLM QLoRA fine-tuned Llama-3.2-3B 3B
Synapse-124M Custom LLM with GQA, MoE, NTK-RoPE 124M

About the Author

Abhinav Tyagi is an LLM Engineer who builds AI systems from the ground up β€” from custom tokenizers and transformer architectures to production RAG pipelines and agentic systems.

Other work:

  • Synapse-124M β€” 124M parameter transformer from scratch: GQA, MoE, Sliding Window, NTK-RoPE, SwiGLU, custom BPE
  • Synapse Wingman β€” Agentic AI desktop assistant (Telegram β†’ PC control, vision, WhatsApp automation)
  • Smart RAG Chatbot β€” Hybrid RAG with Chain of Verification (CoVe), multi-query generation, FAISS
  • Psywarp β€” Published research on multimodal cognitive AI framework (DOI: 10.5281/zenodo.18182199)

πŸ“§ abhinavtyagi5418@gmail.com
πŸ™ GitHub
πŸ’Ό LinkedIn


Citation

@misc{tyagi2026synapsev2,
  author = {Tyagi, Abhinav},
  title  = {Synapse v2: A Decoder-Only Transformer Language Model Built from First Principles},
  year   = {2026},
  url    = {https://huggingface.co/Abhinav-Tyagi/synapse_v2}
}

License

MIT β€” free to use, modify, and distribute with attribution.


"Understanding requires building. Building requires breaking things. Breaking things requires documentation."
β€” Abhinav Tyagi

Downloads last month
1
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support