ruGPT-3 XL 8k
A 1.3B-parameter GPT-2-style language model for Russian with an extended context window of 8192 tokens, trained via continued pretraining from evilfreelancer/ruGPT3XL.
This is a base (pretrained) model, not instruction-tuned.
Model Details
| Parameter | Value |
|---|---|
| Parameters | 1.3B |
| Architecture | GPT-2 (decoder-only transformer) |
| Hidden size | 2048 |
| Layers | 24 |
| Attention heads | 16 |
| FFN intermediate size | 8192 |
| Max sequence length | 8192 |
| Vocabulary | 50,264 tokens (BPE) |
| Activation | GELU |
| Normalization | Pre-LayerNorm |
| Position encoding | Learned absolute (tiled extension) |
| Attention | Alternating sparse/dense |
| Precision | bfloat16 |
| Base model | evilfreelancer/ruGPT3XL (2048 ctx) |
| Fine-tuning dataset | IlyaGusev/gazeta |
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "evilfreelancer/ruGPT3XL-8k"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Context Extension: 2k -> 4k -> 8k
The original ruGPT3XL uses Learned Absolute Positional Embeddings (APE): the position
table embed_positions is a plain nn.Embedding(max_position_embeddings, hidden_size) that
is trained together with all other weights. This means the model has never seen position
indices beyond 2047 and cannot generalize to longer sequences without fine-tuning.
Additionally, the model uses alternating sparse attention where the attention mask
grid is built dynamically as num_blocks = max_position_embeddings // sparse_block_size, so
increasing max_position_embeddings automatically adjusts the sparse grid without any
architectural changes.
Strategy
Context was extended in two steps: 2k -> 4k, then 4k -> 8k, continuing from the previous checkpoint each time.
Step 1 - Positional embedding tiling. The existing embedding matrix is kept intact for known positions (0 to N-1). New positions are filled by cycling through the original table:
position 2048 <- weights of position 0
position 2049 <- weights of position 1
...
position 4095 <- weights of position 2047
position 4096 <- weights of position 0 (second cycle)
...
position 8191 <- weights of position 4095
This is deliberately chosen over linear interpolation: interpolation perturbs all existing embeddings and causes severe perplexity regression on short contexts. Tiling preserves exact weights for positions 0..N-1, so the model does not "forget" how to handle short sequences.
Step 2 - Mixed-length dataset. Training uses a 60/40 mix of long and short examples:
- Long (60%): multiple news articles from
IlyaGusev/gazetapacked together with EOS tokens until reaching the target context length. All packed samples exceed half the target length, ensuring the model is consistently exposed to new position indices. - Short (40%): single-article chunks up to half the target length. Prevents forgetting short-context behavior.
Step 3 - Continued pretraining.
3 epochs, lr=5e-6, cosine decay, warmup_steps=50, gradient_checkpointing=True,
bfloat16, gradient_accumulation_steps=8, hardware: RTX 4090 (48 GB VRAM).
Note on OOM. Training at 8k context caused CUDA memory fragmentation during backpropagation, crashing at step 517/936 despite ~1 GB of technically free VRAM. Fix:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. After this, peak usage dropped from 46.8 GB to 38.5 GB and training completed without issues.
Perplexity
Evaluated on the test split of IlyaGusev/gazeta, strategy non_overlapping, bfloat16.
| Model | PPL @ 2048 | PPL @ 4096 | PPL @ 8192 |
|---|---|---|---|
| ruGPT3XL (baseline) | 11.68 | - | - |
| ruGPT3XL-4k (intermediate) | 11.75 | 12.04 | - |
| ruGPT3XL-8k (this model) | 11.77 | 11.99 | 13.00 |
Regression on the original 2k context is +0.09 PPL - essentially unchanged. The 4k evaluation on the 8k model is slightly better than the intermediate 4k checkpoint (11.99 vs 12.04), indicating that continued pretraining improved overall quality.
VRAM Requirements (inference, batch=1, bfloat16)
| Context length | VRAM peak | KV + activations |
|---|---|---|
| 512 | 2.92 GiB | 0.25 GiB |
| 1024 | 3.16 GiB | 0.49 GiB |
| 2048 | 3.86 GiB | 1.19 GiB |
| 4096 | 6.57 GiB | 3.90 GiB |
| 8192 | 15.98 GiB | 13.31 GiB |
Model weights occupy ~2.67 GiB (bfloat16). Overhead from KV cache and activations grows roughly linearly up to ~2k (sparse attention helps) and becomes near-quadratic beyond that. GPUs with 8 GB VRAM are practical up to ~3.5-4k context.
Generation Speed (bfloat16, 64 new tokens, batch=1, RTX 4090)
| Prompt length | tok/s | ms / token |
|---|---|---|
| 512 | 1444 | 0.7 |
| 1024 | 882 | 1.1 |
| 2048 | 378 | 2.6 |
| 4096 | 67 | 14.9 |
| 8000 | 38 | 26.6 |
Speed is measured for autoregressive decoding with KV cache. The 2x step from 4k to 8k prompt length causes only ~1.8x slowdown (67 -> 38 tok/s), consistent with the linear scaling expected from sparse attention.
Sparse Attention
Inherited from the base model: even-numbered layers (0, 2, 4, ...) use block-sparse causal
attention, odd-numbered layers use standard dense causal attention. The sparse pattern is
computed from config.json at model init and does not require DeepSpeed at inference time.
| Parameter | Value |
|---|---|
sparse_mode |
"alternating" |
sparse_block_size |
16 |
sparse_num_local_blocks |
8 (local window = 128 tokens) |
sparse_num_global_blocks |
1 |
sparse_num_different_global_patterns |
8 |
Limitations
- Base model, not instruction-tuned. Works best for text completion.
- Primarily Russian text. Limited capability in other languages.
- Content may be biased, factually incorrect, or offensive - inherited from the original pretraining corpus.
- At 8k context, inference requires ~16 GB VRAM (bfloat16, batch=1).
Training Details
| Parameter | 2k -> 4k step | 4k -> 8k step |
|---|---|---|
| Base | evilfreelancer/ruGPT3XL | ruGPT3XL-4k (intermediate) |
| Dataset | IlyaGusev/gazeta | IlyaGusev/gazeta |
| Train samples | 2500 (1500 long + 1000 short) | 2500 (1500 long + 1000 short) |
| Val samples | 250 | 250 |
| Packed length | 4096 | 8192 |
| Short max length | 2048 | 4096 |
| Epochs | 3 | 3 |
| Learning rate | 5e-6 | 5e-6 |
| LR scheduler | cosine | cosine |
| Warmup steps | 50 | 50 |
| Batch size (effective) | 8 | 8 |
| Optimizer | AdamW fused | AdamW fused |
| Precision | bfloat16 | bfloat16 |
| Hardware | RTX 4090 48 GB | RTX 4090 48 GB |
| Training time | ~2.6 h | ~3.9 h (incl. resume) |
Citation
@misc{rugpt3xl_8k,
title={ruGPT-3 XL 8k - extended context window via positional embedding tiling},
author={Pavel Rykov},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/evilfreelancer/ruGPT3XL-8k}
}
Links
- evilfreelancer/ruGPT3XL - base model
- ai-forever/rugpt3xl - original Megatron-LM checkpoint
- IlyaGusev/gazeta - fine-tuning dataset
- Extending Input Contexts via Segmented Sequences (arXiv:2310.14633)
- Impact of Positional Encoding on Length Generalization (arXiv:2305.19466)
- Downloads last month
- 168


