ruGPT-3 XL 8k

A 1.3B-parameter GPT-2-style language model for Russian with an extended context window of 8192 tokens, trained via continued pretraining from evilfreelancer/ruGPT3XL.

This is a base (pretrained) model, not instruction-tuned.

Model Details

Parameter Value
Parameters 1.3B
Architecture GPT-2 (decoder-only transformer)
Hidden size 2048
Layers 24
Attention heads 16
FFN intermediate size 8192
Max sequence length 8192
Vocabulary 50,264 tokens (BPE)
Activation GELU
Normalization Pre-LayerNorm
Position encoding Learned absolute (tiled extension)
Attention Alternating sparse/dense
Precision bfloat16
Base model evilfreelancer/ruGPT3XL (2048 ctx)
Fine-tuning dataset IlyaGusev/gazeta

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "evilfreelancer/ruGPT3XL-8k"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)

inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Context Extension: 2k -> 4k -> 8k

The original ruGPT3XL uses Learned Absolute Positional Embeddings (APE): the position table embed_positions is a plain nn.Embedding(max_position_embeddings, hidden_size) that is trained together with all other weights. This means the model has never seen position indices beyond 2047 and cannot generalize to longer sequences without fine-tuning.

Additionally, the model uses alternating sparse attention where the attention mask grid is built dynamically as num_blocks = max_position_embeddings // sparse_block_size, so increasing max_position_embeddings automatically adjusts the sparse grid without any architectural changes.

Strategy

Context was extended in two steps: 2k -> 4k, then 4k -> 8k, continuing from the previous checkpoint each time.

Step 1 - Positional embedding tiling. The existing embedding matrix is kept intact for known positions (0 to N-1). New positions are filled by cycling through the original table:

position 2048 <- weights of position 0
position 2049 <- weights of position 1
...
position 4095 <- weights of position 2047

position 4096 <- weights of position 0   (second cycle)
...
position 8191 <- weights of position 4095

This is deliberately chosen over linear interpolation: interpolation perturbs all existing embeddings and causes severe perplexity regression on short contexts. Tiling preserves exact weights for positions 0..N-1, so the model does not "forget" how to handle short sequences.

Step 2 - Mixed-length dataset. Training uses a 60/40 mix of long and short examples:

  • Long (60%): multiple news articles from IlyaGusev/gazeta packed together with EOS tokens until reaching the target context length. All packed samples exceed half the target length, ensuring the model is consistently exposed to new position indices.
  • Short (40%): single-article chunks up to half the target length. Prevents forgetting short-context behavior.

Step 3 - Continued pretraining. 3 epochs, lr=5e-6, cosine decay, warmup_steps=50, gradient_checkpointing=True, bfloat16, gradient_accumulation_steps=8, hardware: RTX 4090 (48 GB VRAM).

Note on OOM. Training at 8k context caused CUDA memory fragmentation during backpropagation, crashing at step 517/936 despite ~1 GB of technically free VRAM. Fix: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. After this, peak usage dropped from 46.8 GB to 38.5 GB and training completed without issues.

Perplexity

Evaluated on the test split of IlyaGusev/gazeta, strategy non_overlapping, bfloat16.

Perplexity chart

Model PPL @ 2048 PPL @ 4096 PPL @ 8192
ruGPT3XL (baseline) 11.68 - -
ruGPT3XL-4k (intermediate) 11.75 12.04 -
ruGPT3XL-8k (this model) 11.77 11.99 13.00

Regression on the original 2k context is +0.09 PPL - essentially unchanged. The 4k evaluation on the 8k model is slightly better than the intermediate 4k checkpoint (11.99 vs 12.04), indicating that continued pretraining improved overall quality.

VRAM Requirements (inference, batch=1, bfloat16)

VRAM chart

Context length VRAM peak KV + activations
512 2.92 GiB 0.25 GiB
1024 3.16 GiB 0.49 GiB
2048 3.86 GiB 1.19 GiB
4096 6.57 GiB 3.90 GiB
8192 15.98 GiB 13.31 GiB

Model weights occupy ~2.67 GiB (bfloat16). Overhead from KV cache and activations grows roughly linearly up to ~2k (sparse attention helps) and becomes near-quadratic beyond that. GPUs with 8 GB VRAM are practical up to ~3.5-4k context.

Generation Speed (bfloat16, 64 new tokens, batch=1, RTX 4090)

Speed chart

Prompt length tok/s ms / token
512 1444 0.7
1024 882 1.1
2048 378 2.6
4096 67 14.9
8000 38 26.6

Speed is measured for autoregressive decoding with KV cache. The 2x step from 4k to 8k prompt length causes only ~1.8x slowdown (67 -> 38 tok/s), consistent with the linear scaling expected from sparse attention.

Sparse Attention

Inherited from the base model: even-numbered layers (0, 2, 4, ...) use block-sparse causal attention, odd-numbered layers use standard dense causal attention. The sparse pattern is computed from config.json at model init and does not require DeepSpeed at inference time.

Parameter Value
sparse_mode "alternating"
sparse_block_size 16
sparse_num_local_blocks 8 (local window = 128 tokens)
sparse_num_global_blocks 1
sparse_num_different_global_patterns 8

Limitations

  • Base model, not instruction-tuned. Works best for text completion.
  • Primarily Russian text. Limited capability in other languages.
  • Content may be biased, factually incorrect, or offensive - inherited from the original pretraining corpus.
  • At 8k context, inference requires ~16 GB VRAM (bfloat16, batch=1).

Training Details

Parameter 2k -> 4k step 4k -> 8k step
Base evilfreelancer/ruGPT3XL ruGPT3XL-4k (intermediate)
Dataset IlyaGusev/gazeta IlyaGusev/gazeta
Train samples 2500 (1500 long + 1000 short) 2500 (1500 long + 1000 short)
Val samples 250 250
Packed length 4096 8192
Short max length 2048 4096
Epochs 3 3
Learning rate 5e-6 5e-6
LR scheduler cosine cosine
Warmup steps 50 50
Batch size (effective) 8 8
Optimizer AdamW fused AdamW fused
Precision bfloat16 bfloat16
Hardware RTX 4090 48 GB RTX 4090 48 GB
Training time ~2.6 h ~3.9 h (incl. resume)

Citation

@misc{rugpt3xl_8k,
  title={ruGPT-3 XL 8k - extended context window via positional embedding tiling},
  author={Pavel Rykov},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/evilfreelancer/ruGPT3XL-8k}
}

Links

Downloads last month
168
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for evilfreelancer/ruGPT3XL-8k

Finetuned
(1)
this model

Dataset used to train evilfreelancer/ruGPT3XL-8k

Papers for evilfreelancer/ruGPT3XL-8k