Russian Jokes Language Model

A language model trained to generate Russian jokes

Architecture

Transformer decoder with modern components:

  • Positional embeddings: ALiBi (Attention with Linear Biases)
  • Attention mechanism: Grouped Query Attention (GQA)
  • Feed-Forward block: SwiGLU
  • Normalization: RMSNorm (pre-norm)
  • Tokenizer: Byte-level BPE, vocab_size=1024

Model Configurations

Config Layers Heads Hidden dim Parameters
nano 3 4 96 ~0.5M
mini 6 6 384 ~15M
small 12 12 768 ~85M

Training Results

Model LR Steps Val Loss
nano_1 3e-4 10k 3.424
nano_2 1e-3 15k 3.201
nano_3 5e-4 20k 3.199
mini_1 1e-4 10k 3.070
mini_2 3e-4 15k 2.754
mini_3 5e-5 20k 3.050
small_1 5e-5 10k 2.944

The best model — mini_2 (val loss 2.754) — is uploaded to this repository

Bonus Experiments

Additional architectures were implemented and evaluated (results in the notebook):

Model Positional embeddings Attention
nano baseline ALiBi GQA
nano + RoPE RoPE GQA
nano + MLA ALiBi MLA
nano + RoPE+MLA RoPE MLA
mini baseline ALiBi GQA
mini + RoPE RoPE GQA
mini + MLA ALiBi MLA
mini + RoPE+MLA RoPE MLA

Generation Examples

Prompt: Заходит в бар

nano:

Заходит в барман. Вдруг космарх, а тот, что ты, а он не дает не снова.

mini (this model):

Заходит в бар и говорит: — Ой, а почему вы, ведь я была в результате медведевой? Пока приезжает: — Во месяцев, сегодня утра, я вторая. — А я не понял, что ли? — Я не мог. Именно в этом мире.

Usage

tokenizer = ByteLevelBPETokenizer.from_pretrained("OlyaP12/llm-course-hw1")
model = TransformerForCausalLM.from_pretrained("OlyaP12/llm-course-hw1")

text = "Заходит в бар"
input_ids = torch.tensor(tokenizer.encode(text)[:-1])[None, :]
output = model.generate(
    input_ids,
    max_new_tokens=200,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_k=10
)
print(tokenizer.decode(output[0].tolist()))

Dataset

Trained on IgorVolochay/russian_jokes — a collection of Russian jokes (~150k examples)

Downloads last month
120
Safetensors
Model size
11.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train OlyaP12/llm-course-hw1