ruGPT3XL / README.md

Update README.md

d35fdf9 verified 15 days ago

11.9 kB

	---
	language:
	- ru
	library_name: transformers
	tags:
	- text-generation
	- gpt3
	- russian
	- causal-lm
	license: mit
	pipeline_tag: text-generation
	base_model: ai-forever/rugpt3xl
	---

	# ruGPT-3 XL (HuggingFace format)

	A 1.3B-parameter GPT-2-style language model for Russian, converted from the original
	[ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) Megatron-LM checkpoint
	into a native HuggingFace `transformers` format.

	This is a base (pretrained) model, not instruction-tuned. It performs text completion
	and can be fine-tuned for downstream tasks.

	Details in "[A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC)" paper.

	## Model Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Parameters \| 1.3B \|
	\| Architecture \| GPT-2 (decoder-only transformer) \|
	\| Hidden size \| 2048 \|
	\| Layers \| 24 \|
	\| Attention heads \| 16 \|
	\| FFN intermediate size \| 8192 \|
	\| Max sequence length \| 2048 \|
	\| Vocabulary \| 50,264 tokens (BPE) \|
	\| Activation \| GELU \|
	\| Normalization \| Pre-LayerNorm \|
	\| Position encoding \| Learned absolute \|
	\| Attention \| Alternating sparse/dense (see [Sparse Attention](#sparse-attention)) \|
	\| Precision \| float16 \|
	\| Training data \| 80B tokens of Russian text (4 epochs) \|
	\| Test perplexity \| 12.05 \|

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "evilfreelancer/ruGPT3XL"

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name, trust_remote_code=True, device_map="auto"
	)

	inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	repetition_penalty=1.2,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Loading Options

	GPU (float16, recommended):

	```python
	model = AutoModelForCausalLM.from_pretrained(
	model_name, trust_remote_code=True, device_map="auto"
	)
	```

	CPU (float32):

	```python
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu"
	)
	```

	## Chat Template

	The tokenizer includes a simple chat template for question-answering:

	```python
	messages = [
	{"role": "user", "content": "Какая столица России?"},
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	# Output: "Вопрос: Какая столица России?\n\nОтвет: "

	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	> Note: This is a base model, not an instruction-tuned chatbot. The chat template provides
	> a basic structure, but the model may not always follow instructions precisely. For reliable
	> conversational behavior, fine-tune the model on instruction/chat data.

	## Fine-tuning

	The model is fully compatible with standard HuggingFace training workflows.

	### Full Fine-tuning with Trainer

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

	model = AutoModelForCausalLM.from_pretrained(
	model_name, trust_remote_code=True, device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	args = TrainingArguments(
	output_dir="./rugpt3xl-finetuned",
	num_train_epochs=3,
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,
	learning_rate=2e-5,
	fp16=True,
	save_strategy="epoch",
	logging_steps=10,
	)

	trainer = Trainer(
	model=model,
	args=args,
	train_dataset=your_dataset, # dataset with input_ids, attention_mask, labels
	)
	trainer.train()
	```

	### LoRA Fine-tuning with PEFT

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import LoraConfig, get_peft_model, TaskType

	model = AutoModelForCausalLM.from_pretrained(
	model_name, trust_remote_code=True, device_map="auto"
	)

	lora_config = LoraConfig(
	task_type=TaskType.CAUSAL_LM,
	r=16,
	lora_alpha=32,
	lora_dropout=0.05,
	target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
	)
	model = get_peft_model(model, lora_config)
	model.print_trainable_parameters()
	# trainable params: ~14M \|\| all params: 1.4B \|\| trainable%: ~1.0%
	```

	### SFT with TRL

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from trl import SFTTrainer, SFTConfig
	from peft import LoraConfig, TaskType
	from datasets import Dataset

	model = AutoModelForCausalLM.from_pretrained(
	model_name, trust_remote_code=True, device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	lora_config = LoraConfig(
	task_type=TaskType.CAUSAL_LM,
	r=16,
	lora_alpha=32,
	target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
	)

	# Dataset with chat messages format
	train_data = [
	{"messages": [
	{"role": "user", "content": "Какая столица России?"},
	{"role": "assistant", "content": "Москва - столица Российской Федерации."},
	]},
	# ... more examples
	]
	dataset = Dataset.from_list(train_data)

	sft_config = SFTConfig(
	output_dir="./rugpt3xl-sft",
	max_steps=1000,
	per_device_train_batch_size=4,
	learning_rate=2e-5,
	logging_steps=10,
	max_length=512,
	)

	trainer = SFTTrainer(
	model=model,
	args=sft_config,
	train_dataset=dataset,
	peft_config=lora_config,
	processing_class=tokenizer,
	)
	trainer.train()
	```

	### Supported Fine-tuning Features

	\| Feature \| Status \|
	\|---\|---\|
	\| Full parameter training \| Supported \|
	\| Gradient checkpointing \| Supported \|
	\| LoRA / PEFT \| Supported \|
	\| TRL SFTTrainer \| Supported \|
	\| DeepSpeed ZeRO \| Supported \|
	\| FSDP \| Supported \|
	\| KV cache during generation \| Supported \|
	\| `labels` argument for loss computation \| Supported \|

	LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj`

	## Sparse Attention

	This model was originally trained with DeepSpeed's
	[SparseSelfAttention](https://www.deepspeed.ai/tutorials/sparse-attention/) using the
	alternating pattern: even layers (0, 2, 4, ...) use block-sparse attention, odd layers
	(1, 3, 5, ...) use standard dense causal attention. The sparse layers use a
	`FixedSparsityConfig` derived from the "Generating Long Sequences with Sparse Transformers"
	paper (Child et al., 2019).

	This is a critical architectural detail. The model weights were optimized for this specific
	attention pattern during training. Running the model with all-dense attention degrades
	perplexity from ~12 to ~50.

	The converted model fully replicates this sparse attention pattern without any DeepSpeed
	dependency, using a precomputed block-sparse mask applied to standard dense attention.

	\| Attention mode \| Test PPL (Gazeta) \|
	\|---\|---\|
	\| Sparse alternating (original training regime) \| 11.68 \|
	\| All-dense (no sparse mask) \| ~50.1 \|

	### Sparse attention parameters

	The sparse pattern is controlled by `config.json` fields:

	\| Parameter \| Value \| Description \|
	\|---\|---\|---\|
	\| `sparse_mode` \| `"alternating"` \| Even layers sparse, odd layers dense \|
	\| `sparse_block_size` \| `16` \| Token block size for sparse layout \|
	\| `sparse_num_local_blocks` \| `8` \| Local attention window (8 blocks = 128 tokens) \|
	\| `sparse_num_global_blocks` \| `1` \| Global blocks per window \|
	\| `sparse_num_different_global_patterns` \| `8` \| Different heads use different global positions \|

	Each sparse layer applies a per-head block-sparse mask. Within each window of 128 tokens,
	attention is causal (lower-triangular). Across windows, only designated "global" blocks are
	visible, with each attention head using a different global block position within the window.

	To disable sparse attention (e.g. for experiments), set `sparse_mode` to `"none"` in
	`config.json`. This will make all layers use standard dense causal attention.

	## Architecture Details

	The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`):

	```
	RuGPT3XLForCausalLM
	├── model (RuGPT3XLModel)
	│ ├── embed_tokens (Embedding: 50264 x 2048)
	│ ├── embed_positions (Embedding: 2048 x 2048)
	│ ├── embed_dropout (Dropout: 0.1)
	│ ├── layers (x24) (RuGPT3XLDecoderLayer)
	│ │ ├── input_layernorm (LayerNorm: 2048)
	│ │ ├── self_attn (RuGPT3XLAttention)
	│ │ │ ├── q_proj (Linear: 2048 -> 2048)
	│ │ │ ├── k_proj (Linear: 2048 -> 2048)
	│ │ │ ├── v_proj (Linear: 2048 -> 2048)
	│ │ │ ├── o_proj (Linear: 2048 -> 2048)
	│ │ │ ├── attn_dropout (Dropout: 0.1)
	│ │ │ └── resid_dropout (Dropout: 0.1)
	│ │ ├── post_attention_layernorm (LayerNorm: 2048)
	│ │ └── mlp (RuGPT3XMLP)
	│ │ ├── up_proj (Linear: 2048 -> 8192)
	│ │ ├── down_proj (Linear: 8192 -> 2048)
	│ │ ├── act_fn (GELU)
	│ │ └── dropout (Dropout: 0.1)
	│ └── norm (LayerNorm: 2048)
	└── lm_head (Linear: 2048 -> 50264, no bias)
	```

	Even-numbered decoder layers (0, 2, 4, ...) apply block-sparse attention masks. Odd-numbered
	layers use full causal attention. The sparse layout is precomputed at model initialization from
	the config parameters and stored as a non-persistent buffer.

	## Conversion

	This model was converted from the original Megatron-LM checkpoint using a custom script.
	The conversion performs the following transformations:

	1. Strips the `module.` prefix from parameter names (FP16 / DDP wrappers)
	2. Remaps Megatron-LM naming to HuggingFace convention
	3. Splits the fused QKV projection (`[6144, 2048]`) into separate Q, K, V (`[2048, 2048]` each)
	4. Saves weights in safetensors format

	For full conversion details and the script, see the
	[rugpt3xl-convert](https://github.com/EvilFreelancer/rugpt3xl-convert) repository.

	## Limitations

	- This is a base model trained on Russian internet text. It may generate biased, factually
	incorrect, or offensive content.
	- The model was trained primarily on Russian text. It has limited capability in other languages.
	- Maximum context length is 2048 tokens. Inputs longer than this will be truncated.
	- The model is not instruction-tuned and works best for text completion rather than
	following specific instructions.

	## Citation

	```bibtex
	@misc{rugpt3xl,
	title={ruGPT3XL},
	author={Pavel Rykov},
	year={2026},
	publisher={Hugging Face},
	url={https://huggingface.co/evilfreelancer/ruGPT3XL}
	}
	```

	## Links

	- [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar
	- [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model
	- [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase
	- [DeepSpeed Sparse Attention](https://www.deepspeed.ai/tutorials/sparse-attention/) - original sparse attention implementation
	- [evilfreelancer/ruGPT3XL-GGUF](https://huggingface.co/evilfreelancer/ruGPT3XL-GGUF) - converted to GGUF version of model