Instructions to use evilfreelancer/ruGPT3XL-8k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use evilfreelancer/ruGPT3XL-8k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="evilfreelancer/ruGPT3XL-8k", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("evilfreelancer/ruGPT3XL-8k", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use evilfreelancer/ruGPT3XL-8k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "evilfreelancer/ruGPT3XL-8k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evilfreelancer/ruGPT3XL-8k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/evilfreelancer/ruGPT3XL-8k

SGLang

How to use evilfreelancer/ruGPT3XL-8k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "evilfreelancer/ruGPT3XL-8k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evilfreelancer/ruGPT3XL-8k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "evilfreelancer/ruGPT3XL-8k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evilfreelancer/ruGPT3XL-8k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use evilfreelancer/ruGPT3XL-8k with Docker Model Runner:
```
docker model run hf.co/evilfreelancer/ruGPT3XL-8k
```

ruGPT-3 XL 8k

A 1.3B-parameter GPT-2-style language model for Russian with an extended context window of 8192 tokens, trained via continued pretraining from evilfreelancer/ruGPT3XL.

This is a base (pretrained) model, not instruction-tuned.

Model Details

Parameter	Value
Parameters	1.3B
Architecture	GPT-2 (decoder-only transformer)
Hidden size	2048
Layers	24
Attention heads	16
FFN intermediate size	8192
Max sequence length	8192
Vocabulary	50,264 tokens (BPE)
Activation	GELU
Normalization	Pre-LayerNorm
Position encoding	Learned absolute (tiled extension)
Attention	Alternating sparse/dense
Precision	bfloat16
Base model	evilfreelancer/ruGPT3XL (2048 ctx)
Fine-tuning dataset	IlyaGusev/gazeta

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "evilfreelancer/ruGPT3XL-8k"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)

inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Context Extension: 2k -> 4k -> 8k

The original ruGPT3XL uses Learned Absolute Positional Embeddings (APE): the position table embed_positions is a plain nn.Embedding(max_position_embeddings, hidden_size) that is trained together with all other weights. This means the model has never seen position indices beyond 2047 and cannot generalize to longer sequences without fine-tuning.

Additionally, the model uses alternating sparse attention where the attention mask grid is built dynamically as num_blocks = max_position_embeddings // sparse_block_size, so increasing max_position_embeddings automatically adjusts the sparse grid without any architectural changes.

Strategy

Context was extended in two steps: 2k -> 4k, then 4k -> 8k, continuing from the previous checkpoint each time.

Step 1 - Positional embedding tiling. The existing embedding matrix is kept intact for known positions (0 to N-1). New positions are filled by cycling through the original table:

position 2048 <- weights of position 0
position 2049 <- weights of position 1
...
position 4095 <- weights of position 2047

position 4096 <- weights of position 0   (second cycle)
...
position 8191 <- weights of position 4095

This is deliberately chosen over linear interpolation: interpolation perturbs all existing embeddings and causes severe perplexity regression on short contexts. Tiling preserves exact weights for positions 0..N-1, so the model does not "forget" how to handle short sequences.

Step 2 - Mixed-length dataset. Training uses a 60/40 mix of long and short examples:

Long (60%): multiple news articles from IlyaGusev/gazeta packed together with EOS tokens until reaching the target context length. All packed samples exceed half the target length, ensuring the model is consistently exposed to new position indices.
Short (40%): single-article chunks up to half the target length. Prevents forgetting short-context behavior.

Step 3 - Continued pretraining. 3 epochs, lr=5e-6, cosine decay, warmup_steps=50, gradient_checkpointing=True, bfloat16, gradient_accumulation_steps=8, hardware: RTX 4090 (48 GB VRAM).

Note on OOM. Training at 8k context caused CUDA memory fragmentation during backpropagation, crashing at step 517/936 despite ~1 GB of technically free VRAM. Fix: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. After this, peak usage dropped from 46.8 GB to 38.5 GB and training completed without issues.

Perplexity

Evaluated on the test split of IlyaGusev/gazeta, strategy non_overlapping, bfloat16.

Model	PPL @ 2048	PPL @ 4096	PPL @ 8192
ruGPT3XL (baseline)	11.68	-	-
ruGPT3XL-4k (intermediate)	11.75	12.04	-
ruGPT3XL-8k (this model)	11.77	11.99	13.00

Regression on the original 2k context is +0.09 PPL - essentially unchanged. The 4k evaluation on the 8k model is slightly better than the intermediate 4k checkpoint (11.99 vs 12.04), indicating that continued pretraining improved overall quality.

VRAM Requirements (inference, batch=1, bfloat16)

Context length	VRAM peak	KV + activations
512	2.92 GiB	0.25 GiB
1024	3.16 GiB	0.49 GiB
2048	3.86 GiB	1.19 GiB
4096	6.57 GiB	3.90 GiB
8192	15.98 GiB	13.31 GiB

Model weights occupy ~2.67 GiB (bfloat16). Overhead from KV cache and activations grows roughly linearly up to ~2k (sparse attention helps) and becomes near-quadratic beyond that. GPUs with 8 GB VRAM are practical up to ~3.5-4k context.

Generation Speed (bfloat16, 64 new tokens, batch=1, RTX 4090)

Prompt length	tok/s	ms / token
512	1444	0.7
1024	882	1.1
2048	378	2.6
4096	67	14.9
8000	38	26.6

Speed is measured for autoregressive decoding with KV cache. The 2x step from 4k to 8k prompt length causes only ~1.8x slowdown (67 -> 38 tok/s), consistent with the linear scaling expected from sparse attention.

Sparse Attention

Inherited from the base model: even-numbered layers (0, 2, 4, ...) use block-sparse causal attention, odd-numbered layers use standard dense causal attention. The sparse pattern is computed from config.json at model init and does not require DeepSpeed at inference time.

Parameter	Value
`sparse_mode`	`"alternating"`
`sparse_block_size`	`16`
`sparse_num_local_blocks`	`8` (local window = 128 tokens)
`sparse_num_global_blocks`	`1`
`sparse_num_different_global_patterns`	`8`

Limitations

Base model, not instruction-tuned. Works best for text completion.
Primarily Russian text. Limited capability in other languages.
Content may be biased, factually incorrect, or offensive - inherited from the original pretraining corpus.
At 8k context, inference requires ~16 GB VRAM (bfloat16, batch=1).

Training Details

Parameter	2k -> 4k step	4k -> 8k step
Base	evilfreelancer/ruGPT3XL	ruGPT3XL-4k (intermediate)
Dataset	IlyaGusev/gazeta	IlyaGusev/gazeta
Train samples	2500 (1500 long + 1000 short)	2500 (1500 long + 1000 short)
Val samples	250	250
Packed length	4096	8192
Short max length	2048	4096
Epochs	3	3
Learning rate	5e-6	5e-6
LR scheduler	cosine	cosine
Warmup steps	50	50
Batch size (effective)	8	8
Optimizer	AdamW fused	AdamW fused
Precision	bfloat16	bfloat16
Hardware	RTX 4090 48 GB	RTX 4090 48 GB
Training time	~2.6 h	~3.9 h (incl. resume)

Citation

@misc{rugpt3xl_8k,
  title={ruGPT-3 XL 8k - extended context window via positional embedding tiling},
  author={Pavel Rykov},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/evilfreelancer/ruGPT3XL-8k}
}

Model tree for evilfreelancer/ruGPT3XL-8k

Base model

ai-forever/rugpt3xl

Finetuned

evilfreelancer/ruGPT3XL

Finetuned

(2)

this model

Adapters

1 model

Dataset used to train evilfreelancer/ruGPT3XL-8k

Papers for evilfreelancer/ruGPT3XL-8k

Extending Input Contexts of Language Models through Training on Segmented Sequences

Paper • 2310.14633 • Published Oct 23, 2023

The Impact of Positional Encoding on Length Generalization in Transformers

Paper • 2305.19466 • Published May 31, 2023 • 2

evilfreelancer
/

ruGPT3XL-8k