HawkGPT v0.5

Russian-language GPT-style transformer language model (24M params) trained from scratch on synthetic Q&A data.

Architecture

Param Value
Embed dim 512
Layers 8
Query heads 8
KV heads (GQA) 2
FF dim 2048
Vocab size ~3200 (BPE)
Max seq len 256
Parameters 24,384,000

Key design choices:

  • Grouped Query Attention (GQA) — 8 query / 2 KV heads for faster inference
  • ALiBi — position biases instead of learned embeddings (extrapolates to longer sequences)
  • RMSNorm — faster normalization without mean computation
  • No bias terms — in all Linear layers
  • Weight tying — embedding and output projection share weights
  • BPE tokenizer — digit-aware (individual digit tokens), vocab ~3200

Training

  • Mixed precision (bfloat16) with XLA JIT compilation
  • AdamW optimizer, cosine LR schedule with 1000-step warmup
  • EMA (exponential moving average) of weights
  • Batch size 96, max 30 epochs (early stopping patience 10)
  • Trained on NVIDIA RTX 4070 12GB

Training history

Epoch Loss Throughput
1 0.0663 57K t/s
5 0.0520 157K t/s
10 0.0512 360K t/s
13 (best) 0.0479 153K t/s

Benchmark

Overall: 40/72 (55.6%)

Category Score
Division 90%
Knowledge 80%
Algebra 75%
Addition 60%
Multiplication 60%
Multi-step 50%
Subtraction 40%
Word problems 33%
Sequences 20%

Dataset

Synthetic Russian Q&A corpus (~200K+ pairs, ~80M+ characters) covering:

  • Arithmetic (add, sub, mul, div, multi-step)
  • Algebra (linear, quadratic, systems)
  • Sequences, geometry, physics
  • Python code tracing
  • General knowledge (science, history, geography)
  • Dialogue & conversations

Usage

import tensorflow as tf
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.no_padding()
tokenizer.no_truncation()

# Build & load model
from model import build_model
model = build_model(vocab_size=tokenizer.get_vocab_size())
model.load_weights("model_best.weights.h5")

# Generate
def generate(prompt, temperature=0.7, top_k=50, max_new=200):
    bos_id = tokenizer.token_to_id("[BOS]")
    eos_id = tokenizer.token_to_id("[EOS]")
    enc = tokenizer.encode(prompt)
    ids = [bos_id] + enc.ids
    for _ in range(max_new):
        ctx = tf.constant([ids[-256:]], dtype=tf.int32)
        logits = model(ctx, training=False)[0, -1, :] / temperature
        if top_k:
            vals, _ = tf.math.top_k(logits, k=top_k)
            logits = tf.where(logits < vals[-1], -1e9, logits)
        next_id = int(tf.random.categorical(tf.nn.softmax(logits)[None], 1)[0, 0])
        if next_id in (eos_id, tokenizer.token_to_id("[PAD]")):
            break
        ids.append(next_id)
    return tokenizer.decode(ids[len([bos_id] + enc.ids):])

print(generate("Вопрос: 2 + 2 ="))

CLI

python3 generate.py --prompt "Вопрос: Сколько будет 5 * 7?" --temperature 0.3 --top_k 20

Files

File Description
model_best.weights.h5 Best checkpoint weights (94 MB)
tokenizer.json BPE tokenizer
config.py Full model & training config
model.py Model definition (GQA, RMSNorm, ALiBi)
generate.py Inference script

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support