---
language:
- ru
library_name: transformers
tags:
- text-generation
- gpt3
- russian
- causal-lm
license: mit
pipeline_tag: text-generation
base_model: ai-forever/rugpt3xl
---

# ruGPT-3 XL (HuggingFace format)

A 1.3B-parameter GPT-2-style language model for Russian, converted from the original
[ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) Megatron-LM checkpoint
into a native HuggingFace `transformers` format.

This is a **base (pretrained) model**, not instruction-tuned. It performs text completion
and can be fine-tuned for downstream tasks.

Details in "[A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC)" paper.

## Model Details

| Parameter | Value |
|---|---|
| Parameters | 1.3B |
| Architecture | GPT-2 (decoder-only transformer) |
| Hidden size | 2048 |
| Layers | 24 |
| Attention heads | 16 |
| FFN intermediate size | 8192 |
| Max sequence length | 2048 |
| Vocabulary | 50,264 tokens (BPE) |
| Activation | GELU |
| Normalization | Pre-LayerNorm |
| Position encoding | Learned absolute |
| Attention | Alternating sparse/dense (see [Sparse Attention](#sparse-attention)) |
| Precision | float16 |
| Training data | 80B tokens of Russian text (4 epochs) |
| Test perplexity | 12.05 |

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "evilfreelancer/ruGPT3XL"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Loading Options

**GPU (float16, recommended):**

```python
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
```

**CPU (float32):**

```python
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu"
)
```

## Chat Template

The tokenizer includes a simple chat template for question-answering:

```python
messages = [
    {"role": "user", "content": "Какая столица России?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Output: "Вопрос: Какая столица России?\n\nОтвет: "

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

> **Note:** This is a base model, not an instruction-tuned chatbot. The chat template provides
> a basic structure, but the model may not always follow instructions precisely. For reliable
> conversational behavior, fine-tune the model on instruction/chat data.

## Fine-tuning

The model is fully compatible with standard HuggingFace training workflows.

### Full Fine-tuning with Trainer

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

args = TrainingArguments(
    output_dir="./rugpt3xl-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy="epoch",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=your_dataset,  # dataset with input_ids, attention_mask, labels
)
trainer.train()
```

### LoRA Fine-tuning with PEFT

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~14M || all params: 1.4B || trainable%: ~1.0%
```

### SFT with TRL

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from datasets import Dataset

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Dataset with chat messages format
train_data = [
    {"messages": [
        {"role": "user", "content": "Какая столица России?"},
        {"role": "assistant", "content": "Москва - столица Российской Федерации."},
    ]},
    # ... more examples
]
dataset = Dataset.from_list(train_data)

sft_config = SFTConfig(
    output_dir="./rugpt3xl-sft",
    max_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    logging_steps=10,
    max_length=512,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
)
trainer.train()
```

### Supported Fine-tuning Features

| Feature | Status |
|---|---|
| Full parameter training | Supported |
| Gradient checkpointing | Supported |
| LoRA / PEFT | Supported |
| TRL SFTTrainer | Supported |
| DeepSpeed ZeRO | Supported |
| FSDP | Supported |
| KV cache during generation | Supported |
| `labels` argument for loss computation | Supported |

**LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj`

## Sparse Attention

This model was originally trained with DeepSpeed's
[SparseSelfAttention](https://www.deepspeed.ai/tutorials/sparse-attention/) using the
**alternating** pattern: even layers (0, 2, 4, ...) use block-sparse attention, odd layers
(1, 3, 5, ...) use standard dense causal attention. The sparse layers use a
`FixedSparsityConfig` derived from the "Generating Long Sequences with Sparse Transformers"
paper (Child et al., 2019).

This is a **critical** architectural detail. The model weights were optimized for this specific
attention pattern during training. Running the model with all-dense attention degrades
perplexity from ~12 to ~50.

The converted model **fully replicates** this sparse attention pattern without any DeepSpeed
dependency, using a precomputed block-sparse mask applied to standard dense attention.

| Attention mode | Test PPL (Gazeta) |
|---|---|
| Sparse alternating (original training regime) | **11.68** |
| All-dense (no sparse mask) | ~50.1 |

### Sparse attention parameters

The sparse pattern is controlled by `config.json` fields:

| Parameter | Value | Description |
|---|---|---|
| `sparse_mode` | `"alternating"` | Even layers sparse, odd layers dense |
| `sparse_block_size` | `16` | Token block size for sparse layout |
| `sparse_num_local_blocks` | `8` | Local attention window (8 blocks = 128 tokens) |
| `sparse_num_global_blocks` | `1` | Global blocks per window |
| `sparse_num_different_global_patterns` | `8` | Different heads use different global positions |

Each sparse layer applies a per-head block-sparse mask. Within each window of 128 tokens,
attention is causal (lower-triangular). Across windows, only designated "global" blocks are
visible, with each attention head using a different global block position within the window.

To disable sparse attention (e.g. for experiments), set `sparse_mode` to `"none"` in
`config.json`. This will make all layers use standard dense causal attention.

## Architecture Details

The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`):

```
RuGPT3XLForCausalLM
  ├── model (RuGPT3XLModel)
  │     ├── embed_tokens       (Embedding: 50264 x 2048)
  │     ├── embed_positions    (Embedding: 2048 x 2048)
  │     ├── embed_dropout      (Dropout: 0.1)
  │     ├── layers (x24)       (RuGPT3XLDecoderLayer)
  │     │     ├── input_layernorm          (LayerNorm: 2048)
  │     │     ├── self_attn                (RuGPT3XLAttention)
  │     │     │     ├── q_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── k_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── v_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── o_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── attn_dropout       (Dropout: 0.1)
  │     │     │     └── resid_dropout      (Dropout: 0.1)
  │     │     ├── post_attention_layernorm (LayerNorm: 2048)
  │     │     └── mlp                      (RuGPT3XMLP)
  │     │           ├── up_proj            (Linear: 2048 -> 8192)
  │     │           ├── down_proj          (Linear: 8192 -> 2048)
  │     │           ├── act_fn             (GELU)
  │     │           └── dropout            (Dropout: 0.1)
  │     └── norm               (LayerNorm: 2048)
  └── lm_head                  (Linear: 2048 -> 50264, no bias)
```

Even-numbered decoder layers (0, 2, 4, ...) apply block-sparse attention masks. Odd-numbered
layers use full causal attention. The sparse layout is precomputed at model initialization from
the config parameters and stored as a non-persistent buffer.

## Conversion

This model was converted from the original Megatron-LM checkpoint using a custom script.
The conversion performs the following transformations:

1. Strips the `module.` prefix from parameter names (FP16 / DDP wrappers)
2. Remaps Megatron-LM naming to HuggingFace convention
3. Splits the fused QKV projection (`[6144, 2048]`) into separate Q, K, V (`[2048, 2048]` each)
4. Saves weights in safetensors format

For full conversion details and the script, see the
[rugpt3xl-convert](https://github.com/EvilFreelancer/rugpt3xl-convert) repository.

## Limitations

- This is a **base model** trained on Russian internet text. It may generate biased, factually
  incorrect, or offensive content.
- The model was trained primarily on Russian text. It has limited capability in other languages.
- Maximum context length is 2048 tokens. Inputs longer than this will be truncated.
- The model is not instruction-tuned and works best for text completion rather than
  following specific instructions.

## Citation

```bibtex
@misc{rugpt3xl,
  title={ruGPT3XL},
  author={Pavel Rykov},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/evilfreelancer/ruGPT3XL}
}
```

## Links

- [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar
- [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model
- [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase
- [DeepSpeed Sparse Attention](https://www.deepspeed.ai/tutorials/sparse-attention/) - original sparse attention implementation
- [evilfreelancer/ruGPT3XL-GGUF](https://huggingface.co/evilfreelancer/ruGPT3XL-GGUF) - converted to GGUF version of model