| --- |
| language: |
| - ru |
| library_name: transformers |
| tags: |
| - text-generation |
| - gpt3 |
| - russian |
| - causal-lm |
| license: mit |
| pipeline_tag: text-generation |
| base_model: ai-forever/rugpt3xl |
| --- |
| |
| # ruGPT-3 XL (HuggingFace format) |
|
|
| A 1.3B-parameter GPT-2-style language model for Russian, converted from the original |
| [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) Megatron-LM checkpoint |
| into a native HuggingFace `transformers` format. |
|
|
| This is a **base (pretrained) model**, not instruction-tuned. It performs text completion |
| and can be fine-tuned for downstream tasks. |
|
|
| Details in "[A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC)" paper. |
|
|
| ## Model Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | Parameters | 1.3B | |
| | Architecture | GPT-2 (decoder-only transformer) | |
| | Hidden size | 2048 | |
| | Layers | 24 | |
| | Attention heads | 16 | |
| | FFN intermediate size | 8192 | |
| | Max sequence length | 2048 | |
| | Vocabulary | 50,264 tokens (BPE) | |
| | Activation | GELU | |
| | Normalization | Pre-LayerNorm | |
| | Position encoding | Learned absolute | |
| | Attention | Alternating sparse/dense (see [Sparse Attention](#sparse-attention)) | |
| | Precision | float16 | |
| | Training data | 80B tokens of Russian text (4 epochs) | |
| | Test perplexity | 12.05 | |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "evilfreelancer/ruGPT3XL" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, trust_remote_code=True, device_map="auto" |
| ) |
| |
| inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device) |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=100, |
| do_sample=True, |
| temperature=0.7, |
| top_p=0.9, |
| repetition_penalty=1.2, |
| ) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Loading Options |
|
|
| **GPU (float16, recommended):** |
|
|
| ```python |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, trust_remote_code=True, device_map="auto" |
| ) |
| ``` |
|
|
| **CPU (float32):** |
|
|
| ```python |
| import torch |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu" |
| ) |
| ``` |
|
|
| ## Chat Template |
|
|
| The tokenizer includes a simple chat template for question-answering: |
|
|
| ```python |
| messages = [ |
| {"role": "user", "content": "Какая столица России?"}, |
| ] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| # Output: "Вопрос: Какая столица России?\n\nОтвет: " |
| |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=100) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| > **Note:** This is a base model, not an instruction-tuned chatbot. The chat template provides |
| > a basic structure, but the model may not always follow instructions precisely. For reliable |
| > conversational behavior, fine-tune the model on instruction/chat data. |
|
|
| ## Fine-tuning |
|
|
| The model is fully compatible with standard HuggingFace training workflows. |
|
|
| ### Full Fine-tuning with Trainer |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, trust_remote_code=True, device_map="auto" |
| ) |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| |
| args = TrainingArguments( |
| output_dir="./rugpt3xl-finetuned", |
| num_train_epochs=3, |
| per_device_train_batch_size=4, |
| gradient_accumulation_steps=4, |
| learning_rate=2e-5, |
| fp16=True, |
| save_strategy="epoch", |
| logging_steps=10, |
| ) |
| |
| trainer = Trainer( |
| model=model, |
| args=args, |
| train_dataset=your_dataset, # dataset with input_ids, attention_mask, labels |
| ) |
| trainer.train() |
| ``` |
|
|
| ### LoRA Fine-tuning with PEFT |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from peft import LoraConfig, get_peft_model, TaskType |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, trust_remote_code=True, device_map="auto" |
| ) |
| |
| lora_config = LoraConfig( |
| task_type=TaskType.CAUSAL_LM, |
| r=16, |
| lora_alpha=32, |
| lora_dropout=0.05, |
| target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"], |
| ) |
| model = get_peft_model(model, lora_config) |
| model.print_trainable_parameters() |
| # trainable params: ~14M || all params: 1.4B || trainable%: ~1.0% |
| ``` |
|
|
| ### SFT with TRL |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from trl import SFTTrainer, SFTConfig |
| from peft import LoraConfig, TaskType |
| from datasets import Dataset |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, trust_remote_code=True, device_map="auto" |
| ) |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| |
| lora_config = LoraConfig( |
| task_type=TaskType.CAUSAL_LM, |
| r=16, |
| lora_alpha=32, |
| target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], |
| ) |
| |
| # Dataset with chat messages format |
| train_data = [ |
| {"messages": [ |
| {"role": "user", "content": "Какая столица России?"}, |
| {"role": "assistant", "content": "Москва - столица Российской Федерации."}, |
| ]}, |
| # ... more examples |
| ] |
| dataset = Dataset.from_list(train_data) |
| |
| sft_config = SFTConfig( |
| output_dir="./rugpt3xl-sft", |
| max_steps=1000, |
| per_device_train_batch_size=4, |
| learning_rate=2e-5, |
| logging_steps=10, |
| max_length=512, |
| ) |
| |
| trainer = SFTTrainer( |
| model=model, |
| args=sft_config, |
| train_dataset=dataset, |
| peft_config=lora_config, |
| processing_class=tokenizer, |
| ) |
| trainer.train() |
| ``` |
|
|
| ### Supported Fine-tuning Features |
|
|
| | Feature | Status | |
| |---|---| |
| | Full parameter training | Supported | |
| | Gradient checkpointing | Supported | |
| | LoRA / PEFT | Supported | |
| | TRL SFTTrainer | Supported | |
| | DeepSpeed ZeRO | Supported | |
| | FSDP | Supported | |
| | KV cache during generation | Supported | |
| | `labels` argument for loss computation | Supported | |
|
|
| **LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj` |
|
|
| ## Sparse Attention |
|
|
| This model was originally trained with DeepSpeed's |
| [SparseSelfAttention](https://www.deepspeed.ai/tutorials/sparse-attention/) using the |
| **alternating** pattern: even layers (0, 2, 4, ...) use block-sparse attention, odd layers |
| (1, 3, 5, ...) use standard dense causal attention. The sparse layers use a |
| `FixedSparsityConfig` derived from the "Generating Long Sequences with Sparse Transformers" |
| paper (Child et al., 2019). |
|
|
| This is a **critical** architectural detail. The model weights were optimized for this specific |
| attention pattern during training. Running the model with all-dense attention degrades |
| perplexity from ~12 to ~50. |
|
|
| The converted model **fully replicates** this sparse attention pattern without any DeepSpeed |
| dependency, using a precomputed block-sparse mask applied to standard dense attention. |
|
|
| | Attention mode | Test PPL (Gazeta) | |
| |---|---| |
| | Sparse alternating (original training regime) | **11.68** | |
| | All-dense (no sparse mask) | ~50.1 | |
|
|
| ### Sparse attention parameters |
|
|
| The sparse pattern is controlled by `config.json` fields: |
|
|
| | Parameter | Value | Description | |
| |---|---|---| |
| | `sparse_mode` | `"alternating"` | Even layers sparse, odd layers dense | |
| | `sparse_block_size` | `16` | Token block size for sparse layout | |
| | `sparse_num_local_blocks` | `8` | Local attention window (8 blocks = 128 tokens) | |
| | `sparse_num_global_blocks` | `1` | Global blocks per window | |
| | `sparse_num_different_global_patterns` | `8` | Different heads use different global positions | |
|
|
| Each sparse layer applies a per-head block-sparse mask. Within each window of 128 tokens, |
| attention is causal (lower-triangular). Across windows, only designated "global" blocks are |
| visible, with each attention head using a different global block position within the window. |
|
|
| To disable sparse attention (e.g. for experiments), set `sparse_mode` to `"none"` in |
| `config.json`. This will make all layers use standard dense causal attention. |
|
|
| ## Architecture Details |
|
|
| The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`): |
|
|
| ``` |
| RuGPT3XLForCausalLM |
| ├── model (RuGPT3XLModel) |
| │ ├── embed_tokens (Embedding: 50264 x 2048) |
| │ ├── embed_positions (Embedding: 2048 x 2048) |
| │ ├── embed_dropout (Dropout: 0.1) |
| │ ├── layers (x24) (RuGPT3XLDecoderLayer) |
| │ │ ├── input_layernorm (LayerNorm: 2048) |
| │ │ ├── self_attn (RuGPT3XLAttention) |
| │ │ │ ├── q_proj (Linear: 2048 -> 2048) |
| │ │ │ ├── k_proj (Linear: 2048 -> 2048) |
| │ │ │ ├── v_proj (Linear: 2048 -> 2048) |
| │ │ │ ├── o_proj (Linear: 2048 -> 2048) |
| │ │ │ ├── attn_dropout (Dropout: 0.1) |
| │ │ │ └── resid_dropout (Dropout: 0.1) |
| │ │ ├── post_attention_layernorm (LayerNorm: 2048) |
| │ │ └── mlp (RuGPT3XMLP) |
| │ │ ├── up_proj (Linear: 2048 -> 8192) |
| │ │ ├── down_proj (Linear: 8192 -> 2048) |
| │ │ ├── act_fn (GELU) |
| │ │ └── dropout (Dropout: 0.1) |
| │ └── norm (LayerNorm: 2048) |
| └── lm_head (Linear: 2048 -> 50264, no bias) |
| ``` |
|
|
| Even-numbered decoder layers (0, 2, 4, ...) apply block-sparse attention masks. Odd-numbered |
| layers use full causal attention. The sparse layout is precomputed at model initialization from |
| the config parameters and stored as a non-persistent buffer. |
|
|
| ## Conversion |
|
|
| This model was converted from the original Megatron-LM checkpoint using a custom script. |
| The conversion performs the following transformations: |
|
|
| 1. Strips the `module.` prefix from parameter names (FP16 / DDP wrappers) |
| 2. Remaps Megatron-LM naming to HuggingFace convention |
| 3. Splits the fused QKV projection (`[6144, 2048]`) into separate Q, K, V (`[2048, 2048]` each) |
| 4. Saves weights in safetensors format |
|
|
| For full conversion details and the script, see the |
| [rugpt3xl-convert](https://github.com/EvilFreelancer/rugpt3xl-convert) repository. |
|
|
| ## Limitations |
|
|
| - This is a **base model** trained on Russian internet text. It may generate biased, factually |
| incorrect, or offensive content. |
| - The model was trained primarily on Russian text. It has limited capability in other languages. |
| - Maximum context length is 2048 tokens. Inputs longer than this will be truncated. |
| - The model is not instruction-tuned and works best for text completion rather than |
| following specific instructions. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{rugpt3xl, |
| title={ruGPT3XL}, |
| author={Pavel Rykov}, |
| year={2026}, |
| publisher={Hugging Face}, |
| url={https://huggingface.co/evilfreelancer/ruGPT3XL} |
| } |
| ``` |
|
|
| ## Links |
|
|
| - [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar |
| - [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model |
| - [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase |
| - [DeepSpeed Sparse Attention](https://www.deepspeed.ai/tutorials/sparse-attention/) - original sparse attention implementation |
| - [evilfreelancer/ruGPT3XL-GGUF](https://huggingface.co/evilfreelancer/ruGPT3XL-GGUF) - converted to GGUF version of model |
|
|