evilfreelancer commited on
Commit
ff00c33
·
verified ·
1 Parent(s): bec1270

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +236 -3
README.md CHANGED
@@ -1,3 +1,236 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ library_name: transformers
5
+ tags:
6
+ - text-generation
7
+ - gpt3
8
+ - russian
9
+ - causal-lm
10
+ - context-extension
11
+ license: mit
12
+ pipeline_tag: text-generation
13
+ base_model: evilfreelancer/ruGPT3XL
14
+ datasets:
15
+ - IlyaGusev/gazeta
16
+ ---
17
+
18
+ # ruGPT-3 XL 8k
19
+
20
+ A 1.3B-parameter GPT-2-style language model for Russian with an extended context window of
21
+ **8192 tokens**, trained via continued pretraining from
22
+ [evilfreelancer/ruGPT3XL](https://huggingface.co/evilfreelancer/ruGPT3XL).
23
+
24
+ This is a **base (pretrained) model**, not instruction-tuned.
25
+
26
+ ## Model Details
27
+
28
+ | Parameter | Value |
29
+ |---|---|
30
+ | Parameters | 1.3B |
31
+ | Architecture | GPT-2 (decoder-only transformer) |
32
+ | Hidden size | 2048 |
33
+ | Layers | 24 |
34
+ | Attention heads | 16 |
35
+ | FFN intermediate size | 8192 |
36
+ | Max sequence length | **8192** |
37
+ | Vocabulary | 50,264 tokens (BPE) |
38
+ | Activation | GELU |
39
+ | Normalization | Pre-LayerNorm |
40
+ | Position encoding | Learned absolute (tiled extension) |
41
+ | Attention | Alternating sparse/dense |
42
+ | Precision | bfloat16 |
43
+ | Base model | evilfreelancer/ruGPT3XL (2048 ctx) |
44
+ | Fine-tuning dataset | IlyaGusev/gazeta |
45
+
46
+ ## Quick Start
47
+
48
+ ```python
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+
51
+ model_name = "evilfreelancer/ruGPT3XL-8k"
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
54
+ model = AutoModelForCausalLM.from_pretrained(
55
+ model_name,
56
+ trust_remote_code=True,
57
+ torch_dtype="bfloat16",
58
+ device_map="auto",
59
+ )
60
+
61
+ inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
62
+ outputs = model.generate(
63
+ **inputs,
64
+ max_new_tokens=200,
65
+ do_sample=True,
66
+ temperature=0.7,
67
+ top_p=0.9,
68
+ repetition_penalty=1.2,
69
+ )
70
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
71
+ ```
72
+
73
+ ## Context Extension: 2k -> 4k -> 8k
74
+
75
+ The original ruGPT3XL uses **Learned Absolute Positional Embeddings (APE)**: the position
76
+ table `embed_positions` is a plain `nn.Embedding(max_position_embeddings, hidden_size)` that
77
+ is trained together with all other weights. This means the model has never seen position
78
+ indices beyond 2047 and cannot generalize to longer sequences without fine-tuning.
79
+
80
+ Additionally, the model uses **alternating sparse attention** where the attention mask
81
+ grid is built dynamically as `num_blocks = max_position_embeddings // sparse_block_size`, so
82
+ increasing `max_position_embeddings` automatically adjusts the sparse grid without any
83
+ architectural changes.
84
+
85
+ ### Strategy
86
+
87
+ Context was extended in two steps: 2k -> 4k, then 4k -> 8k, continuing from the previous
88
+ checkpoint each time.
89
+
90
+ **Step 1 - Positional embedding tiling.**
91
+ The existing embedding matrix is kept intact for known positions (0 to N-1). New positions
92
+ are filled by cycling through the original table:
93
+
94
+ ```
95
+ position 2048 <- weights of position 0
96
+ position 2049 <- weights of position 1
97
+ ...
98
+ position 4095 <- weights of position 2047
99
+
100
+ position 4096 <- weights of position 0 (second cycle)
101
+ ...
102
+ position 8191 <- weights of position 4095
103
+ ```
104
+
105
+ This is deliberately chosen over linear interpolation: interpolation perturbs all existing
106
+ embeddings and causes severe perplexity regression on short contexts. Tiling preserves
107
+ exact weights for positions 0..N-1, so the model does not "forget" how to handle short
108
+ sequences.
109
+
110
+ **Step 2 - Mixed-length dataset.**
111
+ Training uses a 60/40 mix of long and short examples:
112
+
113
+ - **Long (60%):** multiple news articles from `IlyaGusev/gazeta` packed together with EOS
114
+ tokens until reaching the target context length. All packed samples exceed half the target
115
+ length, ensuring the model is consistently exposed to new position indices.
116
+ - **Short (40%):** single-article chunks up to half the target length. Prevents forgetting
117
+ short-context behavior.
118
+
119
+ **Step 3 - Continued pretraining.**
120
+ 3 epochs, `lr=5e-6`, cosine decay, `warmup_steps=50`, `gradient_checkpointing=True`,
121
+ `bfloat16`, `gradient_accumulation_steps=8`, hardware: RTX 4090 (48 GB VRAM).
122
+
123
+ > **Note on OOM.** Training at 8k context caused CUDA memory fragmentation during
124
+ > backpropagation, crashing at step 517/936 despite ~1 GB of technically free VRAM. Fix:
125
+ > `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. After this, peak usage dropped from
126
+ > 46.8 GB to 38.5 GB and training completed without issues.
127
+
128
+ ### Perplexity
129
+
130
+ Evaluated on the `test` split of `IlyaGusev/gazeta`, strategy `non_overlapping`, `bfloat16`.
131
+
132
+ ![Perplexity chart](assets/ppl_chart.png)
133
+
134
+ | Model | PPL @ 2048 | PPL @ 4096 | PPL @ 8192 |
135
+ |---|---|---|---|
136
+ | ruGPT3XL (baseline) | 11.68 | - | - |
137
+ | ruGPT3XL-4k (intermediate) | 11.75 | 12.04 | - |
138
+ | **ruGPT3XL-8k (this model)** | **11.77** | **11.99** | **13.00** |
139
+
140
+ Regression on the original 2k context is +0.09 PPL - essentially unchanged. The 4k
141
+ evaluation on the 8k model is slightly better than the intermediate 4k checkpoint (11.99 vs
142
+ 12.04), indicating that continued pretraining improved overall quality.
143
+
144
+ ### VRAM Requirements (inference, batch=1, bfloat16)
145
+
146
+ ![VRAM chart](assets/vram_chart.png)
147
+
148
+ | Context length | VRAM peak | KV + activations |
149
+ |---|---|---|
150
+ | 512 | 2.92 GiB | 0.25 GiB |
151
+ | 1024 | 3.16 GiB | 0.49 GiB |
152
+ | 2048 | 3.86 GiB | 1.19 GiB |
153
+ | 4096 | 6.57 GiB | 3.90 GiB |
154
+ | 8192 | 15.98 GiB | 13.31 GiB |
155
+
156
+ Model weights occupy ~2.67 GiB (bfloat16). Overhead from KV cache and activations grows
157
+ roughly linearly up to ~2k (sparse attention helps) and becomes near-quadratic beyond that.
158
+ GPUs with 8 GB VRAM are practical up to ~3.5-4k context.
159
+
160
+ ### Generation Speed (bfloat16, 64 new tokens, batch=1, RTX 4090)
161
+
162
+ ![Speed chart](assets/speed_chart.png)
163
+
164
+ | Prompt length | tok/s | ms / token |
165
+ |---|---|---|
166
+ | 512 | 1444 | 0.7 |
167
+ | 1024 | 882 | 1.1 |
168
+ | 2048 | 378 | 2.6 |
169
+ | 4096 | 67 | 14.9 |
170
+ | 8000 | 38 | 26.6 |
171
+
172
+ Speed is measured for autoregressive decoding with KV cache. The 2x step from 4k to 8k
173
+ prompt length causes only ~1.8x slowdown (67 -> 38 tok/s), consistent with the linear
174
+ scaling expected from sparse attention.
175
+
176
+ ## Sparse Attention
177
+
178
+ Inherited from the base model: even-numbered layers (0, 2, 4, ...) use block-sparse causal
179
+ attention, odd-numbered layers use standard dense causal attention. The sparse pattern is
180
+ computed from `config.json` at model init and does not require DeepSpeed at inference time.
181
+
182
+ | Parameter | Value |
183
+ |---|---|
184
+ | `sparse_mode` | `"alternating"` |
185
+ | `sparse_block_size` | `16` |
186
+ | `sparse_num_local_blocks` | `8` (local window = 128 tokens) |
187
+ | `sparse_num_global_blocks` | `1` |
188
+ | `sparse_num_different_global_patterns` | `8` |
189
+
190
+ ## Limitations
191
+
192
+ - Base model, not instruction-tuned. Works best for text completion.
193
+ - Primarily Russian text. Limited capability in other languages.
194
+ - Content may be biased, factually incorrect, or offensive - inherited from the original
195
+ pretraining corpus.
196
+ - At 8k context, inference requires ~16 GB VRAM (bfloat16, batch=1).
197
+
198
+ ## Training Details
199
+
200
+ | Parameter | 2k -> 4k step | 4k -> 8k step |
201
+ |---|---|---|
202
+ | Base | evilfreelancer/ruGPT3XL | ruGPT3XL-4k (intermediate) |
203
+ | Dataset | IlyaGusev/gazeta | IlyaGusev/gazeta |
204
+ | Train samples | 2500 (1500 long + 1000 short) | 2500 (1500 long + 1000 short) |
205
+ | Val samples | 250 | 250 |
206
+ | Packed length | 4096 | 8192 |
207
+ | Short max length | 2048 | 4096 |
208
+ | Epochs | 3 | 3 |
209
+ | Learning rate | 5e-6 | 5e-6 |
210
+ | LR scheduler | cosine | cosine |
211
+ | Warmup steps | 50 | 50 |
212
+ | Batch size (effective) | 8 | 8 |
213
+ | Optimizer | AdamW fused | AdamW fused |
214
+ | Precision | bfloat16 | bfloat16 |
215
+ | Hardware | RTX 4090 48 GB | RTX 4090 48 GB |
216
+ | Training time | ~2.6 h | ~3.9 h (incl. resume) |
217
+
218
+ ## Citation
219
+
220
+ ```bibtex
221
+ @misc{rugpt3xl_8k,
222
+ title={ruGPT-3 XL 8k - extended context window via positional embedding tiling},
223
+ author={Pavel Rykov},
224
+ year={2026},
225
+ publisher={Hugging Face},
226
+ url={https://huggingface.co/evilfreelancer/ruGPT3XL-8k}
227
+ }
228
+ ```
229
+
230
+ ## Links
231
+
232
+ - [evilfreelancer/ruGPT3XL](https://huggingface.co/evilfreelancer/ruGPT3XL) - base model
233
+ - [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original Megatron-LM checkpoint
234
+ - [IlyaGusev/gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) - fine-tuning dataset
235
+ - [Extending Input Contexts via Segmented Sequences](https://arxiv.org/abs/2310.14633) (arXiv:2310.14633)
236
+ - [Impact of Positional Encoding on Length Generalization](https://arxiv.org/abs/2305.19466) (arXiv:2305.19466)