HawkLabofficial
/

HawkGPT-v0.5

Text Generation

Model card Files Files and versions

HawkGPT-v0.5 / README.md

HawkLabofficial's picture

HawkLabofficial

Upload README.md with huggingface_hub

f1c3365 verified 4 days ago

|

History Blame Contribute Delete

3.67 kB

	---
	language: ru
	license: mit
	library_name: keras
	tags:
	- gpt
	- russian
	- transformer
	- gqa
	- alibi
	- rmsnorm
	pipeline_tag: text-generation
	datasets:
	- HawkLabofficial/HawkGPT-v0.5 # synthetic
	metrics:
	- accuracy
	---

	# HawkGPT v0.5

	Russian-language GPT-style transformer language model (24M params) trained from scratch on synthetic Q&A data.

	## Architecture

	\| Param \| Value \|
	\|-------\|-------\|
	\| Embed dim \| 512 \|
	\| Layers \| 8 \|
	\| Query heads \| 8 \|
	\| KV heads (GQA) \| 2 \|
	\| FF dim \| 2048 \|
	\| Vocab size \| ~3200 (BPE) \|
	\| Max seq len \| 256 \|
	\| Parameters \| 24,384,000 \|

	Key design choices:
	- Grouped Query Attention (GQA) — 8 query / 2 KV heads for faster inference
	- ALiBi — position biases instead of learned embeddings (extrapolates to longer sequences)
	- RMSNorm — faster normalization without mean computation
	- No bias terms — in all Linear layers
	- Weight tying — embedding and output projection share weights
	- BPE tokenizer — digit-aware (individual digit tokens), vocab ~3200

	## Training

	- Mixed precision (bfloat16) with XLA JIT compilation
	- AdamW optimizer, cosine LR schedule with 1000-step warmup
	- EMA (exponential moving average) of weights
	- Batch size 96, max 30 epochs (early stopping patience 10)
	- Trained on NVIDIA RTX 4070 12GB

	### Training history

	\| Epoch \| Loss \| Throughput \|
	\|-------\|------\|------------\|
	\| 1 \| 0.0663 \| 57K t/s \|
	\| 5 \| 0.0520 \| 157K t/s \|
	\| 10 \| 0.0512 \| 360K t/s \|
	\| 13 (best) \| 0.0479 \| 153K t/s \|

	## Benchmark

	Overall: 40/72 (55.6%)

	\| Category \| Score \|
	\|----------\|-------\|
	\| Division \| 90% \|
	\| Knowledge \| 80% \|
	\| Algebra \| 75% \|
	\| Addition \| 60% \|
	\| Multiplication \| 60% \|
	\| Multi-step \| 50% \|
	\| Subtraction \| 40% \|
	\| Word problems \| 33% \|
	\| Sequences \| 20% \|

	## Dataset

	Synthetic Russian Q&A corpus (~200K+ pairs, ~80M+ characters) covering:
	- Arithmetic (add, sub, mul, div, multi-step)
	- Algebra (linear, quadratic, systems)
	- Sequences, geometry, physics
	- Python code tracing
	- General knowledge (science, history, geography)
	- Dialogue & conversations

	## Usage

	```python
	import tensorflow as tf
	from tokenizers import Tokenizer

	# Load tokenizer
	tokenizer = Tokenizer.from_file("tokenizer.json")
	tokenizer.no_padding()
	tokenizer.no_truncation()

	# Build & load model
	from model import build_model
	model = build_model(vocab_size=tokenizer.get_vocab_size())
	model.load_weights("model_best.weights.h5")

	# Generate
	def generate(prompt, temperature=0.7, top_k=50, max_new=200):
	bos_id = tokenizer.token_to_id("[BOS]")
	eos_id = tokenizer.token_to_id("[EOS]")
	enc = tokenizer.encode(prompt)
	ids = [bos_id] + enc.ids
	for _ in range(max_new):
	ctx = tf.constant([ids[-256:]], dtype=tf.int32)
	logits = model(ctx, training=False)[0, -1, :] / temperature
	if top_k:
	vals, _ = tf.math.top_k(logits, k=top_k)
	logits = tf.where(logits < vals[-1], -1e9, logits)
	next_id = int(tf.random.categorical(tf.nn.softmax(logits)[None], 1)[0, 0])
	if next_id in (eos_id, tokenizer.token_to_id("[PAD]")):
	break
	ids.append(next_id)
	return tokenizer.decode(ids[len([bos_id] + enc.ids):])

	print(generate("Вопрос: 2 + 2 ="))
	```

	### CLI
	```bash
	python3 generate.py --prompt "Вопрос: Сколько будет 5 * 7?" --temperature 0.3 --top_k 20
	```

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model_best.weights.h5` \| Best checkpoint weights (94 MB) \|
	\| `tokenizer.json` \| BPE tokenizer \|
	\| `config.py` \| Full model & training config \|
	\| `model.py` \| Model definition (GQA, RMSNorm, ALiBi) \|
	\| `generate.py` \| Inference script \|

	## License

	MIT