Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,236 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ru
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- text-generation
|
| 7 |
+
- gpt3
|
| 8 |
+
- russian
|
| 9 |
+
- causal-lm
|
| 10 |
+
- context-extension
|
| 11 |
+
license: mit
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
base_model: evilfreelancer/ruGPT3XL
|
| 14 |
+
datasets:
|
| 15 |
+
- IlyaGusev/gazeta
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# ruGPT-3 XL 8k
|
| 19 |
+
|
| 20 |
+
A 1.3B-parameter GPT-2-style language model for Russian with an extended context window of
|
| 21 |
+
**8192 tokens**, trained via continued pretraining from
|
| 22 |
+
[evilfreelancer/ruGPT3XL](https://huggingface.co/evilfreelancer/ruGPT3XL).
|
| 23 |
+
|
| 24 |
+
This is a **base (pretrained) model**, not instruction-tuned.
|
| 25 |
+
|
| 26 |
+
## Model Details
|
| 27 |
+
|
| 28 |
+
| Parameter | Value |
|
| 29 |
+
|---|---|
|
| 30 |
+
| Parameters | 1.3B |
|
| 31 |
+
| Architecture | GPT-2 (decoder-only transformer) |
|
| 32 |
+
| Hidden size | 2048 |
|
| 33 |
+
| Layers | 24 |
|
| 34 |
+
| Attention heads | 16 |
|
| 35 |
+
| FFN intermediate size | 8192 |
|
| 36 |
+
| Max sequence length | **8192** |
|
| 37 |
+
| Vocabulary | 50,264 tokens (BPE) |
|
| 38 |
+
| Activation | GELU |
|
| 39 |
+
| Normalization | Pre-LayerNorm |
|
| 40 |
+
| Position encoding | Learned absolute (tiled extension) |
|
| 41 |
+
| Attention | Alternating sparse/dense |
|
| 42 |
+
| Precision | bfloat16 |
|
| 43 |
+
| Base model | evilfreelancer/ruGPT3XL (2048 ctx) |
|
| 44 |
+
| Fine-tuning dataset | IlyaGusev/gazeta |
|
| 45 |
+
|
| 46 |
+
## Quick Start
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 50 |
+
|
| 51 |
+
model_name = "evilfreelancer/ruGPT3XL-8k"
|
| 52 |
+
|
| 53 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 54 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 55 |
+
model_name,
|
| 56 |
+
trust_remote_code=True,
|
| 57 |
+
torch_dtype="bfloat16",
|
| 58 |
+
device_map="auto",
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
|
| 62 |
+
outputs = model.generate(
|
| 63 |
+
**inputs,
|
| 64 |
+
max_new_tokens=200,
|
| 65 |
+
do_sample=True,
|
| 66 |
+
temperature=0.7,
|
| 67 |
+
top_p=0.9,
|
| 68 |
+
repetition_penalty=1.2,
|
| 69 |
+
)
|
| 70 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Context Extension: 2k -> 4k -> 8k
|
| 74 |
+
|
| 75 |
+
The original ruGPT3XL uses **Learned Absolute Positional Embeddings (APE)**: the position
|
| 76 |
+
table `embed_positions` is a plain `nn.Embedding(max_position_embeddings, hidden_size)` that
|
| 77 |
+
is trained together with all other weights. This means the model has never seen position
|
| 78 |
+
indices beyond 2047 and cannot generalize to longer sequences without fine-tuning.
|
| 79 |
+
|
| 80 |
+
Additionally, the model uses **alternating sparse attention** where the attention mask
|
| 81 |
+
grid is built dynamically as `num_blocks = max_position_embeddings // sparse_block_size`, so
|
| 82 |
+
increasing `max_position_embeddings` automatically adjusts the sparse grid without any
|
| 83 |
+
architectural changes.
|
| 84 |
+
|
| 85 |
+
### Strategy
|
| 86 |
+
|
| 87 |
+
Context was extended in two steps: 2k -> 4k, then 4k -> 8k, continuing from the previous
|
| 88 |
+
checkpoint each time.
|
| 89 |
+
|
| 90 |
+
**Step 1 - Positional embedding tiling.**
|
| 91 |
+
The existing embedding matrix is kept intact for known positions (0 to N-1). New positions
|
| 92 |
+
are filled by cycling through the original table:
|
| 93 |
+
|
| 94 |
+
```
|
| 95 |
+
position 2048 <- weights of position 0
|
| 96 |
+
position 2049 <- weights of position 1
|
| 97 |
+
...
|
| 98 |
+
position 4095 <- weights of position 2047
|
| 99 |
+
|
| 100 |
+
position 4096 <- weights of position 0 (second cycle)
|
| 101 |
+
...
|
| 102 |
+
position 8191 <- weights of position 4095
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
This is deliberately chosen over linear interpolation: interpolation perturbs all existing
|
| 106 |
+
embeddings and causes severe perplexity regression on short contexts. Tiling preserves
|
| 107 |
+
exact weights for positions 0..N-1, so the model does not "forget" how to handle short
|
| 108 |
+
sequences.
|
| 109 |
+
|
| 110 |
+
**Step 2 - Mixed-length dataset.**
|
| 111 |
+
Training uses a 60/40 mix of long and short examples:
|
| 112 |
+
|
| 113 |
+
- **Long (60%):** multiple news articles from `IlyaGusev/gazeta` packed together with EOS
|
| 114 |
+
tokens until reaching the target context length. All packed samples exceed half the target
|
| 115 |
+
length, ensuring the model is consistently exposed to new position indices.
|
| 116 |
+
- **Short (40%):** single-article chunks up to half the target length. Prevents forgetting
|
| 117 |
+
short-context behavior.
|
| 118 |
+
|
| 119 |
+
**Step 3 - Continued pretraining.**
|
| 120 |
+
3 epochs, `lr=5e-6`, cosine decay, `warmup_steps=50`, `gradient_checkpointing=True`,
|
| 121 |
+
`bfloat16`, `gradient_accumulation_steps=8`, hardware: RTX 4090 (48 GB VRAM).
|
| 122 |
+
|
| 123 |
+
> **Note on OOM.** Training at 8k context caused CUDA memory fragmentation during
|
| 124 |
+
> backpropagation, crashing at step 517/936 despite ~1 GB of technically free VRAM. Fix:
|
| 125 |
+
> `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. After this, peak usage dropped from
|
| 126 |
+
> 46.8 GB to 38.5 GB and training completed without issues.
|
| 127 |
+
|
| 128 |
+
### Perplexity
|
| 129 |
+
|
| 130 |
+
Evaluated on the `test` split of `IlyaGusev/gazeta`, strategy `non_overlapping`, `bfloat16`.
|
| 131 |
+
|
| 132 |
+

|
| 133 |
+
|
| 134 |
+
| Model | PPL @ 2048 | PPL @ 4096 | PPL @ 8192 |
|
| 135 |
+
|---|---|---|---|
|
| 136 |
+
| ruGPT3XL (baseline) | 11.68 | - | - |
|
| 137 |
+
| ruGPT3XL-4k (intermediate) | 11.75 | 12.04 | - |
|
| 138 |
+
| **ruGPT3XL-8k (this model)** | **11.77** | **11.99** | **13.00** |
|
| 139 |
+
|
| 140 |
+
Regression on the original 2k context is +0.09 PPL - essentially unchanged. The 4k
|
| 141 |
+
evaluation on the 8k model is slightly better than the intermediate 4k checkpoint (11.99 vs
|
| 142 |
+
12.04), indicating that continued pretraining improved overall quality.
|
| 143 |
+
|
| 144 |
+
### VRAM Requirements (inference, batch=1, bfloat16)
|
| 145 |
+
|
| 146 |
+

|
| 147 |
+
|
| 148 |
+
| Context length | VRAM peak | KV + activations |
|
| 149 |
+
|---|---|---|
|
| 150 |
+
| 512 | 2.92 GiB | 0.25 GiB |
|
| 151 |
+
| 1024 | 3.16 GiB | 0.49 GiB |
|
| 152 |
+
| 2048 | 3.86 GiB | 1.19 GiB |
|
| 153 |
+
| 4096 | 6.57 GiB | 3.90 GiB |
|
| 154 |
+
| 8192 | 15.98 GiB | 13.31 GiB |
|
| 155 |
+
|
| 156 |
+
Model weights occupy ~2.67 GiB (bfloat16). Overhead from KV cache and activations grows
|
| 157 |
+
roughly linearly up to ~2k (sparse attention helps) and becomes near-quadratic beyond that.
|
| 158 |
+
GPUs with 8 GB VRAM are practical up to ~3.5-4k context.
|
| 159 |
+
|
| 160 |
+
### Generation Speed (bfloat16, 64 new tokens, batch=1, RTX 4090)
|
| 161 |
+
|
| 162 |
+

|
| 163 |
+
|
| 164 |
+
| Prompt length | tok/s | ms / token |
|
| 165 |
+
|---|---|---|
|
| 166 |
+
| 512 | 1444 | 0.7 |
|
| 167 |
+
| 1024 | 882 | 1.1 |
|
| 168 |
+
| 2048 | 378 | 2.6 |
|
| 169 |
+
| 4096 | 67 | 14.9 |
|
| 170 |
+
| 8000 | 38 | 26.6 |
|
| 171 |
+
|
| 172 |
+
Speed is measured for autoregressive decoding with KV cache. The 2x step from 4k to 8k
|
| 173 |
+
prompt length causes only ~1.8x slowdown (67 -> 38 tok/s), consistent with the linear
|
| 174 |
+
scaling expected from sparse attention.
|
| 175 |
+
|
| 176 |
+
## Sparse Attention
|
| 177 |
+
|
| 178 |
+
Inherited from the base model: even-numbered layers (0, 2, 4, ...) use block-sparse causal
|
| 179 |
+
attention, odd-numbered layers use standard dense causal attention. The sparse pattern is
|
| 180 |
+
computed from `config.json` at model init and does not require DeepSpeed at inference time.
|
| 181 |
+
|
| 182 |
+
| Parameter | Value |
|
| 183 |
+
|---|---|
|
| 184 |
+
| `sparse_mode` | `"alternating"` |
|
| 185 |
+
| `sparse_block_size` | `16` |
|
| 186 |
+
| `sparse_num_local_blocks` | `8` (local window = 128 tokens) |
|
| 187 |
+
| `sparse_num_global_blocks` | `1` |
|
| 188 |
+
| `sparse_num_different_global_patterns` | `8` |
|
| 189 |
+
|
| 190 |
+
## Limitations
|
| 191 |
+
|
| 192 |
+
- Base model, not instruction-tuned. Works best for text completion.
|
| 193 |
+
- Primarily Russian text. Limited capability in other languages.
|
| 194 |
+
- Content may be biased, factually incorrect, or offensive - inherited from the original
|
| 195 |
+
pretraining corpus.
|
| 196 |
+
- At 8k context, inference requires ~16 GB VRAM (bfloat16, batch=1).
|
| 197 |
+
|
| 198 |
+
## Training Details
|
| 199 |
+
|
| 200 |
+
| Parameter | 2k -> 4k step | 4k -> 8k step |
|
| 201 |
+
|---|---|---|
|
| 202 |
+
| Base | evilfreelancer/ruGPT3XL | ruGPT3XL-4k (intermediate) |
|
| 203 |
+
| Dataset | IlyaGusev/gazeta | IlyaGusev/gazeta |
|
| 204 |
+
| Train samples | 2500 (1500 long + 1000 short) | 2500 (1500 long + 1000 short) |
|
| 205 |
+
| Val samples | 250 | 250 |
|
| 206 |
+
| Packed length | 4096 | 8192 |
|
| 207 |
+
| Short max length | 2048 | 4096 |
|
| 208 |
+
| Epochs | 3 | 3 |
|
| 209 |
+
| Learning rate | 5e-6 | 5e-6 |
|
| 210 |
+
| LR scheduler | cosine | cosine |
|
| 211 |
+
| Warmup steps | 50 | 50 |
|
| 212 |
+
| Batch size (effective) | 8 | 8 |
|
| 213 |
+
| Optimizer | AdamW fused | AdamW fused |
|
| 214 |
+
| Precision | bfloat16 | bfloat16 |
|
| 215 |
+
| Hardware | RTX 4090 48 GB | RTX 4090 48 GB |
|
| 216 |
+
| Training time | ~2.6 h | ~3.9 h (incl. resume) |
|
| 217 |
+
|
| 218 |
+
## Citation
|
| 219 |
+
|
| 220 |
+
```bibtex
|
| 221 |
+
@misc{rugpt3xl_8k,
|
| 222 |
+
title={ruGPT-3 XL 8k - extended context window via positional embedding tiling},
|
| 223 |
+
author={Pavel Rykov},
|
| 224 |
+
year={2026},
|
| 225 |
+
publisher={Hugging Face},
|
| 226 |
+
url={https://huggingface.co/evilfreelancer/ruGPT3XL-8k}
|
| 227 |
+
}
|
| 228 |
+
```
|
| 229 |
+
|
| 230 |
+
## Links
|
| 231 |
+
|
| 232 |
+
- [evilfreelancer/ruGPT3XL](https://huggingface.co/evilfreelancer/ruGPT3XL) - base model
|
| 233 |
+
- [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original Megatron-LM checkpoint
|
| 234 |
+
- [IlyaGusev/gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) - fine-tuning dataset
|
| 235 |
+
- [Extending Input Contexts via Segmented Sequences](https://arxiv.org/abs/2310.14633) (arXiv:2310.14633)
|
| 236 |
+
- [Impact of Positional Encoding on Length Generalization](https://arxiv.org/abs/2305.19466) (arXiv:2305.19466)
|