--- language: ru license: mit library_name: keras tags: - gpt - russian - transformer - gqa - alibi - rmsnorm pipeline_tag: text-generation datasets: - HawkLabofficial/HawkGPT-v0.5 # synthetic metrics: - accuracy --- # HawkGPT v0.5 Russian-language GPT-style transformer language model (24M params) trained from scratch on synthetic Q&A data. ## Architecture | Param | Value | |-------|-------| | Embed dim | 512 | | Layers | 8 | | Query heads | 8 | | KV heads (GQA) | 2 | | FF dim | 2048 | | Vocab size | ~3200 (BPE) | | Max seq len | 256 | | Parameters | 24,384,000 | **Key design choices:** - **Grouped Query Attention (GQA)** — 8 query / 2 KV heads for faster inference - **ALiBi** — position biases instead of learned embeddings (extrapolates to longer sequences) - **RMSNorm** — faster normalization without mean computation - **No bias terms** — in all Linear layers - **Weight tying** — embedding and output projection share weights - **BPE tokenizer** — digit-aware (individual digit tokens), vocab ~3200 ## Training - Mixed precision (bfloat16) with XLA JIT compilation - AdamW optimizer, cosine LR schedule with 1000-step warmup - EMA (exponential moving average) of weights - Batch size 96, max 30 epochs (early stopping patience 10) - Trained on NVIDIA RTX 4070 12GB ### Training history | Epoch | Loss | Throughput | |-------|------|------------| | 1 | 0.0663 | 57K t/s | | 5 | 0.0520 | 157K t/s | | 10 | 0.0512 | 360K t/s | | 13 (best) | **0.0479** | 153K t/s | ## Benchmark **Overall: 40/72 (55.6%)** | Category | Score | |----------|-------| | Division | 90% | | Knowledge | 80% | | Algebra | 75% | | Addition | 60% | | Multiplication | 60% | | Multi-step | 50% | | Subtraction | 40% | | Word problems | 33% | | Sequences | 20% | ## Dataset Synthetic Russian Q&A corpus (~200K+ pairs, ~80M+ characters) covering: - Arithmetic (add, sub, mul, div, multi-step) - Algebra (linear, quadratic, systems) - Sequences, geometry, physics - Python code tracing - General knowledge (science, history, geography) - Dialogue & conversations ## Usage ```python import tensorflow as tf from tokenizers import Tokenizer # Load tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") tokenizer.no_padding() tokenizer.no_truncation() # Build & load model from model import build_model model = build_model(vocab_size=tokenizer.get_vocab_size()) model.load_weights("model_best.weights.h5") # Generate def generate(prompt, temperature=0.7, top_k=50, max_new=200): bos_id = tokenizer.token_to_id("[BOS]") eos_id = tokenizer.token_to_id("[EOS]") enc = tokenizer.encode(prompt) ids = [bos_id] + enc.ids for _ in range(max_new): ctx = tf.constant([ids[-256:]], dtype=tf.int32) logits = model(ctx, training=False)[0, -1, :] / temperature if top_k: vals, _ = tf.math.top_k(logits, k=top_k) logits = tf.where(logits < vals[-1], -1e9, logits) next_id = int(tf.random.categorical(tf.nn.softmax(logits)[None], 1)[0, 0]) if next_id in (eos_id, tokenizer.token_to_id("[PAD]")): break ids.append(next_id) return tokenizer.decode(ids[len([bos_id] + enc.ids):]) print(generate("Вопрос: 2 + 2 =")) ``` ### CLI ```bash python3 generate.py --prompt "Вопрос: Сколько будет 5 * 7?" --temperature 0.3 --top_k 20 ``` ## Files | File | Description | |------|-------------| | `model_best.weights.h5` | Best checkpoint weights (94 MB) | | `tokenizer.json` | BPE tokenizer | | `config.py` | Full model & training config | | `model.py` | Model definition (GQA, RMSNorm, ALiBi) | | `generate.py` | Inference script | ## License MIT