evilfreelancer
/

ruGPT3XL-8k

+---
+language:
+- ru
+library_name: transformers
+tags:
+- text-generation
+- gpt3
+- russian
+- causal-lm
+- context-extension
+license: mit
+pipeline_tag: text-generation
+base_model: evilfreelancer/ruGPT3XL
+datasets:
+- IlyaGusev/gazeta
+---
+# ruGPT-3 XL 8k
+A 1.3B-parameter GPT-2-style language model for Russian with an extended context window of
+**8192 tokens**, trained via continued pretraining from
+[evilfreelancer/ruGPT3XL](https://huggingface.co/evilfreelancer/ruGPT3XL).
+This is a **base (pretrained) model**, not instruction-tuned.
+## Model Details
+| Parameter | Value |
+|---|---|
+| Parameters | 1.3B |
+| Architecture | GPT-2 (decoder-only transformer) |
+| Hidden size | 2048 |
+| Layers | 24 |
+| Attention heads | 16 |
+| FFN intermediate size | 8192 |
+| Max sequence length | **8192** |
+| Vocabulary | 50,264 tokens (BPE) |
+| Activation | GELU |
+| Normalization | Pre-LayerNorm |
+| Position encoding | Learned absolute (tiled extension) |
+| Attention | Alternating sparse/dense |
+| Precision | bfloat16 |
+| Base model | evilfreelancer/ruGPT3XL (2048 ctx) |
+| Fine-tuning dataset | IlyaGusev/gazeta |
+## Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "evilfreelancer/ruGPT3XL-8k"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    trust_remote_code=True,
+    torch_dtype="bfloat16",
+    device_map="auto",
+)
+inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.9,
+    repetition_penalty=1.2,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Context Extension: 2k -> 4k -> 8k
+The original ruGPT3XL uses **Learned Absolute Positional Embeddings (APE)**: the position
+table `embed_positions` is a plain `nn.Embedding(max_position_embeddings, hidden_size)` that
+is trained together with all other weights. This means the model has never seen position
+indices beyond 2047 and cannot generalize to longer sequences without fine-tuning.
+Additionally, the model uses **alternating sparse attention** where the attention mask
+grid is built dynamically as `num_blocks = max_position_embeddings // sparse_block_size`, so
+increasing `max_position_embeddings` automatically adjusts the sparse grid without any
+architectural changes.
+### Strategy
+Context was extended in two steps: 2k -> 4k, then 4k -> 8k, continuing from the previous
+checkpoint each time.
+**Step 1 - Positional embedding tiling.**
+The existing embedding matrix is kept intact for known positions (0 to N-1). New positions
+are filled by cycling through the original table:
+```
+position 2048 <- weights of position 0
+position 2049 <- weights of position 1
+...
+position 4095 <- weights of position 2047
+position 4096 <- weights of position 0   (second cycle)
+...
+position 8191 <- weights of position 4095
+```
+This is deliberately chosen over linear interpolation: interpolation perturbs all existing
+embeddings and causes severe perplexity regression on short contexts. Tiling preserves
+exact weights for positions 0..N-1, so the model does not "forget" how to handle short
+sequences.
+**Step 2 - Mixed-length dataset.**
+Training uses a 60/40 mix of long and short examples:
+- **Long (60%):** multiple news articles from `IlyaGusev/gazeta` packed together with EOS
+  tokens until reaching the target context length. All packed samples exceed half the target
+  length, ensuring the model is consistently exposed to new position indices.
+- **Short (40%):** single-article chunks up to half the target length. Prevents forgetting
+  short-context behavior.
+**Step 3 - Continued pretraining.**
+3 epochs, `lr=5e-6`, cosine decay, `warmup_steps=50`, `gradient_checkpointing=True`,
+`bfloat16`, `gradient_accumulation_steps=8`, hardware: RTX 4090 (48 GB VRAM).
+> **Note on OOM.** Training at 8k context caused CUDA memory fragmentation during
+> backpropagation, crashing at step 517/936 despite ~1 GB of technically free VRAM. Fix:
+> `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. After this, peak usage dropped from
+> 46.8 GB to 38.5 GB and training completed without issues.
+### Perplexity
+Evaluated on the `test` split of `IlyaGusev/gazeta`, strategy `non_overlapping`, `bfloat16`.
+![Perplexity chart](assets/ppl_chart.png)
+| Model | PPL @ 2048 | PPL @ 4096 | PPL @ 8192 |
+|---|---|---|---|
+| ruGPT3XL (baseline) | 11.68 | - | - |
+| ruGPT3XL-4k (intermediate) | 11.75 | 12.04 | - |
+| **ruGPT3XL-8k (this model)** | **11.77** | **11.99** | **13.00** |
+Regression on the original 2k context is +0.09 PPL - essentially unchanged. The 4k
+evaluation on the 8k model is slightly better than the intermediate 4k checkpoint (11.99 vs
+12.04), indicating that continued pretraining improved overall quality.
+### VRAM Requirements (inference, batch=1, bfloat16)
+![VRAM chart](assets/vram_chart.png)
+| Context length | VRAM peak | KV + activations |
+|---|---|---|
+| 512 | 2.92 GiB | 0.25 GiB |
+| 1024 | 3.16 GiB | 0.49 GiB |
+| 2048 | 3.86 GiB | 1.19 GiB |
+| 4096 | 6.57 GiB | 3.90 GiB |
+| 8192 | 15.98 GiB | 13.31 GiB |
+Model weights occupy ~2.67 GiB (bfloat16). Overhead from KV cache and activations grows
+roughly linearly up to ~2k (sparse attention helps) and becomes near-quadratic beyond that.
+GPUs with 8 GB VRAM are practical up to ~3.5-4k context.
+### Generation Speed (bfloat16, 64 new tokens, batch=1, RTX 4090)
+![Speed chart](assets/speed_chart.png)
+| Prompt length | tok/s | ms / token |
+|---|---|---|
+| 512 | 1444 | 0.7 |
+| 1024 | 882 | 1.1 |
+| 2048 | 378 | 2.6 |
+| 4096 | 67 | 14.9 |
+| 8000 | 38 | 26.6 |
+Speed is measured for autoregressive decoding with KV cache. The 2x step from 4k to 8k
+prompt length causes only ~1.8x slowdown (67 -> 38 tok/s), consistent with the linear
+scaling expected from sparse attention.
+## Sparse Attention
+Inherited from the base model: even-numbered layers (0, 2, 4, ...) use block-sparse causal
+attention, odd-numbered layers use standard dense causal attention. The sparse pattern is
+computed from `config.json` at model init and does not require DeepSpeed at inference time.
+| Parameter | Value |
+|---|---|
+| `sparse_mode` | `"alternating"` |
+| `sparse_block_size` | `16` |
+| `sparse_num_local_blocks` | `8` (local window = 128 tokens) |
+| `sparse_num_global_blocks` | `1` |
+| `sparse_num_different_global_patterns` | `8` |
+## Limitations
+- Base model, not instruction-tuned. Works best for text completion.
+- Primarily Russian text. Limited capability in other languages.
+- Content may be biased, factually incorrect, or offensive - inherited from the original
+  pretraining corpus.
+- At 8k context, inference requires ~16 GB VRAM (bfloat16, batch=1).
+## Training Details
+| Parameter | 2k -> 4k step | 4k -> 8k step |
+|---|---|---|
+| Base | evilfreelancer/ruGPT3XL | ruGPT3XL-4k (intermediate) |
+| Dataset | IlyaGusev/gazeta | IlyaGusev/gazeta |
+| Train samples | 2500 (1500 long + 1000 short) | 2500 (1500 long + 1000 short) |
+| Val samples | 250 | 250 |
+| Packed length | 4096 | 8192 |
+| Short max length | 2048 | 4096 |
+| Epochs | 3 | 3 |
+| Learning rate | 5e-6 | 5e-6 |
+| LR scheduler | cosine | cosine |
+| Warmup steps | 50 | 50 |
+| Batch size (effective) | 8 | 8 |
+| Optimizer | AdamW fused | AdamW fused |
+| Precision | bfloat16 | bfloat16 |
+| Hardware | RTX 4090 48 GB | RTX 4090 48 GB |
+| Training time | ~2.6 h | ~3.9 h (incl. resume) |
+## Citation
+```bibtex
+@misc{rugpt3xl_8k,
+  title={ruGPT-3 XL 8k - extended context window via positional embedding tiling},
+  author={Pavel Rykov},
+  year={2026},
+  publisher={Hugging Face},
+  url={https://huggingface.co/evilfreelancer/ruGPT3XL-8k}
+}
+```
+## Links
+- [evilfreelancer/ruGPT3XL](https://huggingface.co/evilfreelancer/ruGPT3XL) - base model
+- [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original Megatron-LM checkpoint
+- [IlyaGusev/gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) - fine-tuning dataset
+- [Extending Input Contexts via Segmented Sequences](https://arxiv.org/abs/2310.14633) (arXiv:2310.14633)
+- [Impact of Positional Encoding on Length Generalization](https://arxiv.org/abs/2305.19466) (arXiv:2305.19466)