|
|
---
|
|
|
license: apache-2.0
|
|
|
language:
|
|
|
- ru
|
|
|
- en
|
|
|
- multilingual
|
|
|
tags:
|
|
|
- mistral
|
|
|
- russian
|
|
|
- english
|
|
|
- code
|
|
|
- machine-learning
|
|
|
- nlp
|
|
|
- transformer
|
|
|
- gqa
|
|
|
- rmsnorm
|
|
|
- swiglu
|
|
|
- rope
|
|
|
- flash-attention-2
|
|
|
- dark-ultima
|
|
|
- 5tb
|
|
|
- ultra-large
|
|
|
- experimental
|
|
|
- sharded
|
|
|
pipeline_tag: text-generation
|
|
|
size_categories: 5TB
|
|
|
---
|
|
|
|
|
|
# RadonDarkUltima (5TB) - Ultra-Large Scale Model
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
RadonDarkUltima is an experimental **5TB parameter** ultra-large scale Mistral-based transformer model designed for cutting-edge research and development. This model represents the pinnacle of the RADON ecosystem, pushing the boundaries of what's possible with open-source language models.
|
|
|
|
|
|
### ⚠️ **EXPERIMENTAL MODEL - RESEARCH USE ONLY**
|
|
|
|
|
|
This model is in experimental stage and requires massive computational resources. The framework is prepared but actual weights will be uploaded separately.
|
|
|
|
|
|
## Key Features
|
|
|
|
|
|
- **Parameters**: **2.5T parameters** (2,500,000,000,000)
|
|
|
- **Architecture**: Mistral with Llama 3 innovations (GQA, RMSNorm, SwiGLU, RoPE)
|
|
|
- **Context Length**: **32,768 tokens** (32K)
|
|
|
- **Languages**: Russian, English, Code, Multilingual
|
|
|
- **Sharding**: 100 shards of ~50GB each
|
|
|
- **Quantization**: FP16 + INT8 hybrid for memory efficiency
|
|
|
|
|
|
## Technical Specifications
|
|
|
|
|
|
- **Hidden Size**: 16,384
|
|
|
- **Layers**: 200
|
|
|
- **Attention Heads**: 128
|
|
|
- **KV Heads**: 16 (GQA ratio 8:1)
|
|
|
- **Intermediate Size**: 65,536
|
|
|
- **Vocabulary**: 256,000 tokens
|
|
|
- **Memory**: ~5TB (FP16)
|
|
|
|
|
|
## Hardware Requirements
|
|
|
|
|
|
### Minimum Requirements
|
|
|
- **GPU**: 5TB+ VRAM (A100 x64+ or H100 x32+)
|
|
|
- **RAM**: 10TB+ system memory
|
|
|
- **Storage**: 15TB+ NVMe SSD
|
|
|
- **Network**: High-speed connection for shard loading
|
|
|
|
|
|
### Recommended Setup
|
|
|
- **GPU**: 10TB+ VRAM (H100 x64+ or equivalent)
|
|
|
- **RAM**: 20TB+ system memory
|
|
|
- **Storage**: 20TB+ NVMe SSD
|
|
|
- **Infrastructure**: Data center with high-speed networking
|
|
|
|
|
|
## Sharding Strategy
|
|
|
|
|
|
The model is split into 100 shards for efficient loading:
|
|
|
|
|
|
- **Shard 1**: Embeddings (256,000 x 16,384)
|
|
|
- **Shards 2-99**: Transformer layers (200 layers distributed)
|
|
|
- **Shard 100**: Final layer norm + LM head
|
|
|
|
|
|
Each shard is approximately 50GB in size.
|
|
|
|
|
|
## Usage (Framework Only)
|
|
|
|
|
|
⚠️ **Note**: This repository contains only the model framework. Actual weights will be uploaded separately.
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
import torch
|
|
|
|
|
|
# Load model framework (weights not included)
|
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
|
"MagistrTheOne/RadonDarkUltima",
|
|
|
torch_dtype=torch.float16,
|
|
|
device_map="auto",
|
|
|
low_cpu_mem_usage=True
|
|
|
)
|
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("MagistrTheOne/RadonDarkUltima")
|
|
|
|
|
|
# Generate text (requires actual weights)
|
|
|
prompt = "Привет! Как дела?"
|
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
|
|
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
|
print(response)
|
|
|
```
|
|
|
|
|
|
## Model Architecture
|
|
|
|
|
|
```
|
|
|
RadonDarkUltima (5TB parameters)
|
|
|
├── Mistral Base Architecture
|
|
|
├── Llama 3 Innovations
|
|
|
│ ├── Grouped Query Attention (GQA) - 8:1 ratio
|
|
|
│ ├── RMSNorm Layer Normalization
|
|
|
│ ├── SwiGLU Activation
|
|
|
│ └── Rotary Position Embeddings (RoPE)
|
|
|
├── Flash Attention 2
|
|
|
├── Gradient Checkpointing
|
|
|
├── Sharded Weights (100 shards)
|
|
|
├── FP16 + INT8 Hybrid Quantization
|
|
|
└── Ultra-Large Scale Optimization
|
|
|
```
|
|
|
|
|
|
## Performance Expectations
|
|
|
|
|
|
This experimental model is designed for:
|
|
|
|
|
|
- **Ultra-long context processing** (32K+ tokens)
|
|
|
- **Advanced reasoning** and problem-solving
|
|
|
- **Multilingual understanding** (Russian, English, Code)
|
|
|
- **Research applications** requiring massive scale
|
|
|
- **Benchmarking** against largest commercial models
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- **Experimental**: Not production-ready
|
|
|
- **Massive resources**: Requires data center infrastructure
|
|
|
- **Weights pending**: Framework only, weights uploaded separately
|
|
|
- **Research use**: Intended for research and development
|
|
|
- **High cost**: Significant computational requirements
|
|
|
|
|
|
## Creator
|
|
|
|
|
|
**MagistrTheOne** - Creator and lead developer of RADON
|
|
|
- Specialized in ultra-large scale AI models
|
|
|
- Focus on Russian-English machine learning applications
|
|
|
- Open-source AI advocate and researcher
|
|
|
- Creator of the RADON ecosystem
|
|
|
|
|
|
## Contact
|
|
|
|
|
|
- GitHub: [MagistrTheOne/Radon2BMistral](https://github.com/MagistrTheOne/Radon2BMistral)
|
|
|
- Hugging Face: [MagistrTheOne/RadonDarkUltima](https://huggingface.co/MagistrTheOne/RadonDarkUltima)
|
|
|
- Creator: [MagistrTheOne](https://github.com/MagistrTheOne)
|
|
|
|
|
|
## License
|
|
|
|
|
|
Apache 2.0 License
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
```bibtex
|
|
|
@misc{radon-dark-ultima-2024,
|
|
|
title={RadonDarkUltima: 5TB Parameter Ultra-Large Scale Mistral-based Transformer},
|
|
|
author={MagistrTheOne},
|
|
|
year={2024},
|
|
|
url={https://huggingface.co/MagistrTheOne/RadonDarkUltima}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
**Created with ❤️ by MagistrTheOne**
|
|
|
**Pushing the boundaries of open-source AI! 🚀**
|
|
|
|
|
|
## Warning
|
|
|
|
|
|
This is an experimental research model requiring massive computational resources. Use responsibly and only for research purposes.
|
|
|
|