What is BUVN-2.0?

BUVN-2.0 is a 109.5 million parameter GPT-style decoder-only transformer language model, built entirely from scratch — no pretrained weights, no fine-tuning shortcuts. Trained on 2 billion tokens from the C4 dataset on a single NVIDIA H100 NVL GPU in approximately 2 hours.

It is the foundation model of the Beuvian AI Ecosystem — a family of three specialized models:

         ╔═══════════════════════════════════╗
         ║  🧠 BUVN-2.0 (Foundation Model)  ║
         ║  109.5M params  |  PPL 29.19     ║
         ╚════════════╦════════════╦════════╝
                      ║            ║
              ╔═══════╩═══╗  ╔════╩════════╗
              ║ 💻 SRVN   ║  ║  📈 MNI     ║
              ║ Code Agent ║  ║  Finance    ║
              ║ (Planned)  ║  ║  (Planned)  ║
              ╚═══════════╝  ╚═════════════╝

"Don't just use AI. Understand it. Build it. Own it."


Model Performance

🏆 WikiText-103 Perplexity Leaderboard

Rank Model Organization Parameters PPL (↓) Training Tokens
1 LLaMA-2 7B Meta 7B 5.47 2T
2 LLaMA 7B Meta 7B 7.73 1T
3 Pythia-1B EleutherAI 1B 16.71 300B
4 GPT-2 Large OpenAI 774M 19.93 ~40B
5 GPT-2 Medium OpenAI 355M 22.76 ~40B
6 OPT-125M Meta 125M 27.65 300B
7 RWKV-169M RWKV 169M 29.01 300B
8 🟢 BUVN-2.0 (this model) Bhuvan 109.5M 29.19 2B
9 Pythia-160M EleutherAI 160M 29.33 300B
10 GPT-2 Small OpenAI 124M 29.41 ~40B
11 GPT-Neo 125M EleutherAI 125M 32.43 300B

BUVN-2.0 beats GPT-2 Small with 9x fewer parameters and 20,000x less training data. The architecture is competitive — the gap to higher ranks is purely about scale.


📊 Full Benchmark Results

Quality Metrics

Metric Value
Val Perplexity 29.19
Train Perplexity 28.33
Bits Per Character 4.87
Top-1 Accuracy 37.88%
Top-5 Accuracy 60.34%
Overfit Gap 0.03 (healthy)
vs Random (32K) 99.9% better

Speed Metrics

Metric Value
Training Throughput 320,000 tok/s
Forward Throughput 126,976 tok/s
Generation Speed 204 tok/s
Generation Latency 4.9 ms/token
MFU (Training) 24%
Peak VRAM 8.14 GB
Training Time ~2 hours

📈 Training Progress

Perplexity over Training Steps:

  37,600 ┤●
         │ ╲
  10,000 ┤  ╲
         │    ╲
     142 ┤     ●
         │      ╲
      78 ┤       ●
         │        ╲──╲
      55 ┤              ●───╲
         │                    ╲───╲
      42 ┤                         ●───╲
         │                               ╲───╲
      36 ┤                                     ●───╲
         │                                           ╲───●── 29.19 ✅
      29 ┤                                                    Beats GPT-2!
         └──────────────────────────────────────────────────
         0     250   1K    2K    4K    6K    8K   10K   15K
                           Training Steps →

Architecture

graph TB
    INPUT["📝 Input Tokens"] --> EMB["Token Embedding<br/>(weight-tied with output)"]
    EMB --> DROP["Dropout"]
    DROP --> TB1["🔲 Transformer Block 1"]
    TB1 --> TB2["🔲 Transformer Block 2"]
    TB2 --> DOTS["⋮ (12 blocks total)"]
    DOTS --> TBN["🔲 Transformer Block 12"]
    TBN --> NORM["RMSNorm (final)"]
    NORM --> OUT["📤 Output Projection → 32K Logits"]

    subgraph TB["Each Transformer Block"]
        direction TB
        A1["RMSNorm"] --> A2["Multi-Head Attention<br/>12 heads × 64 dims + RoPE"]
        A2 --> A3["+ Residual"]
        A3 --> A4["RMSNorm"]
        A4 --> A5["SwiGLU FFN<br/>768 → 2048 → 768"]
        A5 --> A6["+ Residual"]
    end

    style INPUT fill:#0d1117,stroke:#58a6ff,color:#fff
    style OUT fill:#0d1117,stroke:#16c79a,color:#fff
    style TB1 fill:#161b22,stroke:#58a6ff,color:#fff
    style TB2 fill:#161b22,stroke:#58a6ff,color:#fff
    style TBN fill:#161b22,stroke:#58a6ff,color:#fff
    style EMB fill:#161b22,stroke:#bc6ff1,color:#fff
    style NORM fill:#161b22,stroke:#f39c12,color:#fff

Model Configuration

Parameter Value Description
d_model 768 Embedding dimension
n_layers 12 Transformer blocks
n_heads 12 Attention heads
head_dim 64 Per-head dimension
vocab_size 32,000 BPE vocabulary
max_seq_len 1,024 Context window
ffn_hidden 2,048 SwiGLU hidden dim
dropout 0.0 No dropout (pre-training)
bias False No bias terms (LLaMA-style)
Total Params 109.53M
Non-Embedding 84.95M Excluding shared embeddings

Architecture Highlights

Component Choice
Position Encoding RoPE (Rotary)
Normalization RMSNorm (pre-norm)
Feedforward SwiGLU
Attention Flash (SDPA)
Weight Tying Yes (emb = output)
Initialization Depth-scaled residual
Design Choice Why
RoPE over absolute Better generalization, relative positions
RMSNorm over LayerNorm 10-15% faster, same quality
SwiGLU over ReLU 2-3% better PPL via gating
No bias Standard in LLaMA, PaLM
Weight tying Saves 24.6M parameters
Pre-norm More stable training

Parameter Breakdown

╔══════════════════════════════════════════════════╗
║  BUVN-2.0 Parameter Distribution                 ║
╠══════════════════════════════════════════════════╣
║                                                  ║
║  Token Embedding     ████████░░░░  24.6M  (22%)  ║
║  (weight-tied)                                   ║
║                                                  ║
║  12× Attention       ██████████░░  28.3M  (26%)  ║
║  (Wq, Wk, Wv, Wo)                               ║
║                                                  ║
║  12× SwiGLU FFN      ████████████  56.6M  (52%)  ║
║  (W1, W2, W3)        ← Most "knowledge" here    ║
║                                                  ║
║  Norms + Other       ░░░░░░░░░░░░   18K  (<1%)  ║
║                                                  ║
║  TOTAL               ████████████ 109.5M (100%)  ║
╚══════════════════════════════════════════════════╝

Training Details

Data Pipeline

C4 Dataset (HuggingFace)
    │ 8 parallel stream workers (no download, 1.48M tok/s)
    ↓
BPE Tokenizer (32K vocab, trained on 100K samples in 14s)
    │ tokenize in memory
    ↓
Binary files: train.bin (3.8 GB) + val.bin (20 MB)
    │ 2.0 billion tokens total
    ↓
Memory-mapped DataLoader → GPU (zero-copy I/O)

Training Configuration

Setting Value
Optimizer AdamW
Peak LR 6×10⁻⁴
Min LR 6×10⁻⁵
Schedule Cosine decay with 500-step warmup
Batch Size 64 × 2 gradient accumulation = 128
Tokens/Iteration 131,072
Total Steps 15,000
Total Tokens ~2 billion
Precision bfloat16
Compiler torch.compile (1.5x speedup)
Weight Decay 0.1
Grad Clip 1.0
Beta1 / Beta2 0.9 / 0.95

Hardware

Component Spec
GPU NVIDIA H100 NVL (96 GB VRAM)
CPU AMD EPYC 9V84 96-Core (40 vCPUs)
RAM 314 GB
PyTorch 2.9.1 + CUDA 12.8

Usage

Download and Run

# 1. Clone the repo
# git clone https://github.com/bhuvan0808/beuvian.git
# cd beuvian/BUVN-1.1
# pip install -r requirements.txt

# 2. Download weights from this HuggingFace repo
python scripts/load_from_hub.py

# 3. Generate text
python inference/generate.py \
    --prompt "The future of artificial intelligence" \
    --checkpoint checkpoints/buvn_2.0_best.pt \
    --tokenizer tokenizer/tokenizer_32k.json \
    --max_new_tokens 150 \
    --temperature 0.7 \
    --top_k 50

Load in Python

import torch
from model.config import BUVNConfig
from model.model import BUVNModel

# Load checkpoint
ckpt = torch.load('buvn_2.0_best.pt', map_location='cuda', weights_only=False)

# Handle torch.compile prefix
state_dict = ckpt['model']
for k in list(state_dict.keys()):
    if k.startswith('_orig_mod.'):
        state_dict[k[len('_orig_mod.'):]] = state_dict.pop(k)

# Build model
config = BUVNConfig.from_dict(ckpt['model_args'])
model = BUVNModel(config).cuda()
model.load_state_dict(state_dict)
model.eval()

# Generate
from inference.sample import generate
text, usage = generate(model, tokenizer, "Your prompt here",
                       max_new_tokens=100, temperature=0.7, top_k=50, device='cuda')
print(text)

API Server

python api/app.py \
    --checkpoint checkpoints/buvn_2.0_best.pt \
    --tokenizer tokenizer/tokenizer_32k.json \
    --port 8000

# Test with curl:
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The history of science", "max_tokens": 100, "temperature": 0.7}'

Sampling Parameters

Parameter Range Default Effect
temperature 0.0 – 2.0 0.7 0 = deterministic, higher = more creative
top_k 0 – 32000 50 Only sample from top K most likely tokens
top_p 0.0 – 1.0 Nucleus sampling (adaptive token filtering)
max_tokens 1 – 1024 100 Maximum generation length

Sample Outputs

Prompt: "The history of artificial intelligence began"

The number of people living with heart disease in the United States is projected to increase by nearly 20 million every year, according to the Centers for Disease Control and Prevention. The Centers for Disease Control and Prevention (CDC) created the National Heart Disease Prevention and Control Program in 2007, the American Heart Association (AHA) released its findings on March 25, 2018...

Prompt: "The president of the United States announced"

Here at The Ritz and Suites, we are proud to offer a variety of unique and unique packages. Our experienced staff is here to help you find the perfect vacation, getaway or special event. Treat yourself to a luxurious vacation in the comfort of your own home!

Prompt: "In a groundbreaking study published today"

If you are having a dental emergency, you may be wondering how to get the most out of your dental treatment, right? Well, that's where the dental implant comes in. The dental implant is the most extensive prosthetic bone in the world...

Note: The model generates fluent, grammatically correct web-text. It does not follow prompt topics because it has not been instruction-tuned yet. This is expected behavior for a foundation model. Instruction tuning (SFT) is the planned next step.


The Beuvian Ecosystem

graph LR
    A["📚 Raw Text<br/>C4 (2B tokens)"] -->|Pre-training| B["🧠 BUVN-2.0<br/>Foundation"]
    B -->|Fine-tune on Code| C["💻 SRVN<br/>Code Agent"]
    B -->|Train on Markets| D["📈 MNI<br/>Finance"]

    style A fill:#1a1a2e,stroke:#16c79a,color:#fff
    style B fill:#0d1117,stroke:#58a6ff,color:#fff,stroke-width:3px
    style C fill:#0d1117,stroke:#f39c12,color:#fff,stroke-width:2px
    style D fill:#0d1117,stroke:#bc6ff1,color:#fff,stroke-width:2px
Model Role Status Description
🧠 BUVN Foundation Released General language model — the base for everything
💻 SRVN Code Agent 🔜 Planned Fine-tuned on code (The Stack v2), agentic workflows
📈 MNI Finance 🔜 Planned Trained on market data, SEC filings, sentiment analysis

Roadmap

  • ✅ BUVN-1.1 — 13.7M params, WikiText-103, PPL 35.87
  • BUVN-2.0 — 109.5M params, C4 2B tokens, PPL 29.19 (beats GPT-2 Small!)
  • 🔜 Instruction Tuning (SFT) on OpenAssistant + Alpaca
  • 🔜 SRVN — Code agent fine-tuning
  • 🔜 MNI — Finance model training
  • 📋 RLHF / DPO alignment
  • 📋 Chat UI deployment
  • 📋 HuggingFace Spaces demo

Files in This Repository

File Size Description
buvn_2.0_best.pt 1.31 GB Model checkpoint (109.5M params, trained 15K steps)
tokenizer_32k.json 2.2 MB 32K BPE tokenizer (Byte-Level, trained on C4)
config.json ~200 B Model hyperparameters
README.md This model card

Citation

@misc{buvn2026,
  title={BUVN-2.0: A Foundation Language Model Built From Scratch},
  author={Bhuvan},
  year={2026},
  url={https://huggingface.co/bhuvan0808/buvn-2.0},
  note={109.5M parameter decoder-only transformer, PPL 29.19 on WikiText-103}
}

Links

Resource URL
🐙 GitHub bhuvan0808/beuvian
📘 Documentation docs/
🤗 HuggingFace bhuvan0808/buvn-2.0

Built with ❤️ by Bhuvan

BUVN-2.0 — Part of the Beuvian AI Ecosystem

footer
Downloads last month
1,497
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bhuvan0808/buvn-2.0

Finetunes
1 model

Dataset used to train bhuvan0808/buvn-2.0

Evaluation results