--- language: - ru library_name: transformers tags: - text-generation - gpt3 - russian - causal-lm license: mit pipeline_tag: text-generation base_model: ai-forever/rugpt3xl --- # ruGPT-3 XL (HuggingFace format) A 1.3B-parameter GPT-2-style language model for Russian, converted from the original [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) Megatron-LM checkpoint into a native HuggingFace `transformers` format. This is a **base (pretrained) model**, not instruction-tuned. It performs text completion and can be fine-tuned for downstream tasks. Details in "[A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC)" paper. ## Model Details | Parameter | Value | |---|---| | Parameters | 1.3B | | Architecture | GPT-2 (decoder-only transformer) | | Hidden size | 2048 | | Layers | 24 | | Attention heads | 16 | | FFN intermediate size | 8192 | | Max sequence length | 2048 | | Vocabulary | 50,264 tokens (BPE) | | Activation | GELU | | Normalization | Pre-LayerNorm | | Position encoding | Learned absolute | | Attention | Alternating sparse/dense (see [Sparse Attention](#sparse-attention)) | | Precision | float16 | | Training data | 80B tokens of Russian text (4 epochs) | | Test perplexity | 12.05 | ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "evilfreelancer/ruGPT3XL" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto" ) inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Loading Options **GPU (float16, recommended):** ```python model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto" ) ``` **CPU (float32):** ```python import torch model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu" ) ``` ## Chat Template The tokenizer includes a simple chat template for question-answering: ```python messages = [ {"role": "user", "content": "Какая столица России?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Output: "Вопрос: Какая столица России?\n\nОтвет: " inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` > **Note:** This is a base model, not an instruction-tuned chatbot. The chat template provides > a basic structure, but the model may not always follow instructions precisely. For reliable > conversational behavior, fine-tune the model on instruction/chat data. ## Fine-tuning The model is fully compatible with standard HuggingFace training workflows. ### Full Fine-tuning with Trainer ```python from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) args = TrainingArguments( output_dir="./rugpt3xl-finetuned", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, fp16=True, save_strategy="epoch", logging_steps=10, ) trainer = Trainer( model=model, args=args, train_dataset=your_dataset, # dataset with input_ids, attention_mask, labels ) trainer.train() ``` ### LoRA Fine-tuning with PEFT ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model, TaskType model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto" ) lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"], ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: ~14M || all params: 1.4B || trainable%: ~1.0% ``` ### SFT with TRL ```python from transformers import AutoModelForCausalLM, AutoTokenizer from trl import SFTTrainer, SFTConfig from peft import LoraConfig, TaskType from datasets import Dataset model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], ) # Dataset with chat messages format train_data = [ {"messages": [ {"role": "user", "content": "Какая столица России?"}, {"role": "assistant", "content": "Москва - столица Российской Федерации."}, ]}, # ... more examples ] dataset = Dataset.from_list(train_data) sft_config = SFTConfig( output_dir="./rugpt3xl-sft", max_steps=1000, per_device_train_batch_size=4, learning_rate=2e-5, logging_steps=10, max_length=512, ) trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, peft_config=lora_config, processing_class=tokenizer, ) trainer.train() ``` ### Supported Fine-tuning Features | Feature | Status | |---|---| | Full parameter training | Supported | | Gradient checkpointing | Supported | | LoRA / PEFT | Supported | | TRL SFTTrainer | Supported | | DeepSpeed ZeRO | Supported | | FSDP | Supported | | KV cache during generation | Supported | | `labels` argument for loss computation | Supported | **LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj` ## Sparse Attention This model was originally trained with DeepSpeed's [SparseSelfAttention](https://www.deepspeed.ai/tutorials/sparse-attention/) using the **alternating** pattern: even layers (0, 2, 4, ...) use block-sparse attention, odd layers (1, 3, 5, ...) use standard dense causal attention. The sparse layers use a `FixedSparsityConfig` derived from the "Generating Long Sequences with Sparse Transformers" paper (Child et al., 2019). This is a **critical** architectural detail. The model weights were optimized for this specific attention pattern during training. Running the model with all-dense attention degrades perplexity from ~12 to ~50. The converted model **fully replicates** this sparse attention pattern without any DeepSpeed dependency, using a precomputed block-sparse mask applied to standard dense attention. | Attention mode | Test PPL (Gazeta) | |---|---| | Sparse alternating (original training regime) | **11.68** | | All-dense (no sparse mask) | ~50.1 | ### Sparse attention parameters The sparse pattern is controlled by `config.json` fields: | Parameter | Value | Description | |---|---|---| | `sparse_mode` | `"alternating"` | Even layers sparse, odd layers dense | | `sparse_block_size` | `16` | Token block size for sparse layout | | `sparse_num_local_blocks` | `8` | Local attention window (8 blocks = 128 tokens) | | `sparse_num_global_blocks` | `1` | Global blocks per window | | `sparse_num_different_global_patterns` | `8` | Different heads use different global positions | Each sparse layer applies a per-head block-sparse mask. Within each window of 128 tokens, attention is causal (lower-triangular). Across windows, only designated "global" blocks are visible, with each attention head using a different global block position within the window. To disable sparse attention (e.g. for experiments), set `sparse_mode` to `"none"` in `config.json`. This will make all layers use standard dense causal attention. ## Architecture Details The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`): ``` RuGPT3XLForCausalLM ├── model (RuGPT3XLModel) │ ├── embed_tokens (Embedding: 50264 x 2048) │ ├── embed_positions (Embedding: 2048 x 2048) │ ├── embed_dropout (Dropout: 0.1) │ ├── layers (x24) (RuGPT3XLDecoderLayer) │ │ ├── input_layernorm (LayerNorm: 2048) │ │ ├── self_attn (RuGPT3XLAttention) │ │ │ ├── q_proj (Linear: 2048 -> 2048) │ │ │ ├── k_proj (Linear: 2048 -> 2048) │ │ │ ├── v_proj (Linear: 2048 -> 2048) │ │ │ ├── o_proj (Linear: 2048 -> 2048) │ │ │ ├── attn_dropout (Dropout: 0.1) │ │ │ └── resid_dropout (Dropout: 0.1) │ │ ├── post_attention_layernorm (LayerNorm: 2048) │ │ └── mlp (RuGPT3XMLP) │ │ ├── up_proj (Linear: 2048 -> 8192) │ │ ├── down_proj (Linear: 8192 -> 2048) │ │ ├── act_fn (GELU) │ │ └── dropout (Dropout: 0.1) │ └── norm (LayerNorm: 2048) └── lm_head (Linear: 2048 -> 50264, no bias) ``` Even-numbered decoder layers (0, 2, 4, ...) apply block-sparse attention masks. Odd-numbered layers use full causal attention. The sparse layout is precomputed at model initialization from the config parameters and stored as a non-persistent buffer. ## Conversion This model was converted from the original Megatron-LM checkpoint using a custom script. The conversion performs the following transformations: 1. Strips the `module.` prefix from parameter names (FP16 / DDP wrappers) 2. Remaps Megatron-LM naming to HuggingFace convention 3. Splits the fused QKV projection (`[6144, 2048]`) into separate Q, K, V (`[2048, 2048]` each) 4. Saves weights in safetensors format For full conversion details and the script, see the [rugpt3xl-convert](https://github.com/EvilFreelancer/rugpt3xl-convert) repository. ## Limitations - This is a **base model** trained on Russian internet text. It may generate biased, factually incorrect, or offensive content. - The model was trained primarily on Russian text. It has limited capability in other languages. - Maximum context length is 2048 tokens. Inputs longer than this will be truncated. - The model is not instruction-tuned and works best for text completion rather than following specific instructions. ## Citation ```bibtex @misc{rugpt3xl, title={ruGPT3XL}, author={Pavel Rykov}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/evilfreelancer/ruGPT3XL} } ``` ## Links - [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar - [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model - [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase - [DeepSpeed Sparse Attention](https://www.deepspeed.ai/tutorials/sparse-attention/) - original sparse attention implementation - [evilfreelancer/ruGPT3XL-GGUF](https://huggingface.co/evilfreelancer/ruGPT3XL-GGUF) - converted to GGUF version of model