ruGPT3XL / README.md
evilfreelancer's picture
Update README.md
d35fdf9 verified
---
language:
- ru
library_name: transformers
tags:
- text-generation
- gpt3
- russian
- causal-lm
license: mit
pipeline_tag: text-generation
base_model: ai-forever/rugpt3xl
---
# ruGPT-3 XL (HuggingFace format)
A 1.3B-parameter GPT-2-style language model for Russian, converted from the original
[ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) Megatron-LM checkpoint
into a native HuggingFace `transformers` format.
This is a **base (pretrained) model**, not instruction-tuned. It performs text completion
and can be fine-tuned for downstream tasks.
Details in "[A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC)" paper.
## Model Details
| Parameter | Value |
|---|---|
| Parameters | 1.3B |
| Architecture | GPT-2 (decoder-only transformer) |
| Hidden size | 2048 |
| Layers | 24 |
| Attention heads | 16 |
| FFN intermediate size | 8192 |
| Max sequence length | 2048 |
| Vocabulary | 50,264 tokens (BPE) |
| Activation | GELU |
| Normalization | Pre-LayerNorm |
| Position encoding | Learned absolute |
| Attention | Alternating sparse/dense (see [Sparse Attention](#sparse-attention)) |
| Precision | float16 |
| Training data | 80B tokens of Russian text (4 epochs) |
| Test perplexity | 12.05 |
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "evilfreelancer/ruGPT3XL"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Loading Options
**GPU (float16, recommended):**
```python
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
```
**CPU (float32):**
```python
import torch
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu"
)
```
## Chat Template
The tokenizer includes a simple chat template for question-answering:
```python
messages = [
{"role": "user", "content": "Какая столица России?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Output: "Вопрос: Какая столица России?\n\nОтвет: "
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
> **Note:** This is a base model, not an instruction-tuned chatbot. The chat template provides
> a basic structure, but the model may not always follow instructions precisely. For reliable
> conversational behavior, fine-tune the model on instruction/chat data.
## Fine-tuning
The model is fully compatible with standard HuggingFace training workflows.
### Full Fine-tuning with Trainer
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
args = TrainingArguments(
output_dir="./rugpt3xl-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
save_strategy="epoch",
logging_steps=10,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=your_dataset, # dataset with input_ids, attention_mask, labels
)
trainer.train()
```
### LoRA Fine-tuning with PEFT
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~14M || all params: 1.4B || trainable%: ~1.0%
```
### SFT with TRL
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from datasets import Dataset
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Dataset with chat messages format
train_data = [
{"messages": [
{"role": "user", "content": "Какая столица России?"},
{"role": "assistant", "content": "Москва - столица Российской Федерации."},
]},
# ... more examples
]
dataset = Dataset.from_list(train_data)
sft_config = SFTConfig(
output_dir="./rugpt3xl-sft",
max_steps=1000,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=10,
max_length=512,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
peft_config=lora_config,
processing_class=tokenizer,
)
trainer.train()
```
### Supported Fine-tuning Features
| Feature | Status |
|---|---|
| Full parameter training | Supported |
| Gradient checkpointing | Supported |
| LoRA / PEFT | Supported |
| TRL SFTTrainer | Supported |
| DeepSpeed ZeRO | Supported |
| FSDP | Supported |
| KV cache during generation | Supported |
| `labels` argument for loss computation | Supported |
**LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj`
## Sparse Attention
This model was originally trained with DeepSpeed's
[SparseSelfAttention](https://www.deepspeed.ai/tutorials/sparse-attention/) using the
**alternating** pattern: even layers (0, 2, 4, ...) use block-sparse attention, odd layers
(1, 3, 5, ...) use standard dense causal attention. The sparse layers use a
`FixedSparsityConfig` derived from the "Generating Long Sequences with Sparse Transformers"
paper (Child et al., 2019).
This is a **critical** architectural detail. The model weights were optimized for this specific
attention pattern during training. Running the model with all-dense attention degrades
perplexity from ~12 to ~50.
The converted model **fully replicates** this sparse attention pattern without any DeepSpeed
dependency, using a precomputed block-sparse mask applied to standard dense attention.
| Attention mode | Test PPL (Gazeta) |
|---|---|
| Sparse alternating (original training regime) | **11.68** |
| All-dense (no sparse mask) | ~50.1 |
### Sparse attention parameters
The sparse pattern is controlled by `config.json` fields:
| Parameter | Value | Description |
|---|---|---|
| `sparse_mode` | `"alternating"` | Even layers sparse, odd layers dense |
| `sparse_block_size` | `16` | Token block size for sparse layout |
| `sparse_num_local_blocks` | `8` | Local attention window (8 blocks = 128 tokens) |
| `sparse_num_global_blocks` | `1` | Global blocks per window |
| `sparse_num_different_global_patterns` | `8` | Different heads use different global positions |
Each sparse layer applies a per-head block-sparse mask. Within each window of 128 tokens,
attention is causal (lower-triangular). Across windows, only designated "global" blocks are
visible, with each attention head using a different global block position within the window.
To disable sparse attention (e.g. for experiments), set `sparse_mode` to `"none"` in
`config.json`. This will make all layers use standard dense causal attention.
## Architecture Details
The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`):
```
RuGPT3XLForCausalLM
├── model (RuGPT3XLModel)
│ ├── embed_tokens (Embedding: 50264 x 2048)
│ ├── embed_positions (Embedding: 2048 x 2048)
│ ├── embed_dropout (Dropout: 0.1)
│ ├── layers (x24) (RuGPT3XLDecoderLayer)
│ │ ├── input_layernorm (LayerNorm: 2048)
│ │ ├── self_attn (RuGPT3XLAttention)
│ │ │ ├── q_proj (Linear: 2048 -> 2048)
│ │ │ ├── k_proj (Linear: 2048 -> 2048)
│ │ │ ├── v_proj (Linear: 2048 -> 2048)
│ │ │ ├── o_proj (Linear: 2048 -> 2048)
│ │ │ ├── attn_dropout (Dropout: 0.1)
│ │ │ └── resid_dropout (Dropout: 0.1)
│ │ ├── post_attention_layernorm (LayerNorm: 2048)
│ │ └── mlp (RuGPT3XMLP)
│ │ ├── up_proj (Linear: 2048 -> 8192)
│ │ ├── down_proj (Linear: 8192 -> 2048)
│ │ ├── act_fn (GELU)
│ │ └── dropout (Dropout: 0.1)
│ └── norm (LayerNorm: 2048)
└── lm_head (Linear: 2048 -> 50264, no bias)
```
Even-numbered decoder layers (0, 2, 4, ...) apply block-sparse attention masks. Odd-numbered
layers use full causal attention. The sparse layout is precomputed at model initialization from
the config parameters and stored as a non-persistent buffer.
## Conversion
This model was converted from the original Megatron-LM checkpoint using a custom script.
The conversion performs the following transformations:
1. Strips the `module.` prefix from parameter names (FP16 / DDP wrappers)
2. Remaps Megatron-LM naming to HuggingFace convention
3. Splits the fused QKV projection (`[6144, 2048]`) into separate Q, K, V (`[2048, 2048]` each)
4. Saves weights in safetensors format
For full conversion details and the script, see the
[rugpt3xl-convert](https://github.com/EvilFreelancer/rugpt3xl-convert) repository.
## Limitations
- This is a **base model** trained on Russian internet text. It may generate biased, factually
incorrect, or offensive content.
- The model was trained primarily on Russian text. It has limited capability in other languages.
- Maximum context length is 2048 tokens. Inputs longer than this will be truncated.
- The model is not instruction-tuned and works best for text completion rather than
following specific instructions.
## Citation
```bibtex
@misc{rugpt3xl,
title={ruGPT3XL},
author={Pavel Rykov},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/evilfreelancer/ruGPT3XL}
}
```
## Links
- [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar
- [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model
- [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase
- [DeepSpeed Sparse Attention](https://www.deepspeed.ai/tutorials/sparse-attention/) - original sparse attention implementation
- [evilfreelancer/ruGPT3XL-GGUF](https://huggingface.co/evilfreelancer/ruGPT3XL-GGUF) - converted to GGUF version of model