|
|
--- |
|
|
language: |
|
|
- en |
|
|
- multilingual |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-generation |
|
|
- transformers |
|
|
- pytorch |
|
|
- deepxr |
|
|
- helion |
|
|
- xlarge |
|
|
- instruction-tuned |
|
|
- causal-lm |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- SlimPajama |
|
|
- StarCoder |
|
|
- OpenOrca |
|
|
- UltraChat |
|
|
- WizardLM |
|
|
- Alpaca |
|
|
metrics: |
|
|
- perplexity |
|
|
- accuracy |
|
|
- bleu |
|
|
- rouge |
|
|
base_model: DeepXR/Helion-V1.5 |
|
|
model-index: |
|
|
- name: DeepXR/Helion-V1.5-XL |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: MMLU |
|
|
type: mmlu |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 78.9 |
|
|
name: 5-shot Accuracy |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Code Generation |
|
|
dataset: |
|
|
name: HumanEval |
|
|
type: humaneval |
|
|
metrics: |
|
|
- type: pass@1 |
|
|
value: 67.8 |
|
|
name: Pass@1 |
|
|
--- |
|
|
|
|
|
# Helion-V1.5-XL |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<img src="https://imgur.com/aUIJXf7.png" alt="Helion-V1 Logo" width="100%"/> |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
Helion-V1.5-XL is a 16.2 billion parameter large language model designed for advanced natural language understanding and generation tasks. Built upon the foundation of Helion-V1.5, this XL variant incorporates architectural improvements, expanded training data, and enhanced optimization techniques to deliver superior performance across diverse benchmarks. |
|
|
|
|
|
The model employs a decoder-only transformer architecture with Grouped Query Attention (GQA), RoPE positional encodings, and SwiGLU activations. Training utilized 4.5 trillion tokens from curated high-quality sources spanning web text, scientific literature, code repositories, and instruction-following datasets. |
|
|
|
|
|
## Architecture Specifications |
|
|
|
|
|
``` |
|
|
Model Type: Decoder-Only Transformer |
|
|
Total Parameters: 16,247,832,576 |
|
|
Trainable Parameters: 16,247,832,576 |
|
|
Non-trainable Parameters: 0 |
|
|
|
|
|
Layers: 48 |
|
|
Attention Heads: 32 (Query) |
|
|
Key-Value Heads: 8 (GQA) |
|
|
Hidden Dimension: 6144 |
|
|
Intermediate Dimension: 24576 |
|
|
Head Dimension: 192 |
|
|
|
|
|
Vocabulary Size: 100,000 |
|
|
Maximum Context Length: 16,384 tokens |
|
|
RoPE Theta: 10,000.0 |
|
|
RoPE Scaling: Linear (factor: 2.0) |
|
|
|
|
|
Activation Function: SwiGLU |
|
|
Normalization: RMSNorm (eps: 1e-6) |
|
|
Attention Mechanism: Grouped Query Attention |
|
|
Positional Encoding: Rotary Position Embedding |
|
|
Flash Attention: Enabled (v2) |
|
|
|
|
|
Precision: bfloat16 |
|
|
``` |
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
### Language Understanding |
|
|
|
|
|
| Benchmark | Metric | Helion-V1.5-XL | Helion-V1.5 | LLaMA-2-13B | Mistral-7B | GPT-3.5-Turbo | |
|
|
|-----------|--------|----------------|-------------|-------------|------------|---------------| |
|
|
| MMLU (5-shot) | Accuracy | **78.9** | 62.3 | 55.8 | 62.5 | 70.0 | |
|
|
| HellaSwag (10-shot) | Accuracy | **85.7** | 79.1 | 82.3 | 81.3 | 85.5 | |
|
|
| ARC-Challenge (25-shot) | Accuracy | **82.1** | 71.4 | 78.9 | 79.8 | 85.2 | |
|
|
| ARC-Easy (25-shot) | Accuracy | **89.6** | 84.2 | 85.3 | 87.1 | 91.3 | |
|
|
| PIQA (zero-shot) | Accuracy | **83.4** | 79.8 | 80.5 | 81.2 | 84.1 | |
|
|
| WinoGrande (5-shot) | Accuracy | **77.3** | 72.1 | 73.7 | 74.8 | 78.2 | |
|
|
| OpenBookQA (zero-shot) | Accuracy | **68.7** | 61.4 | 63.2 | 65.9 | 71.5 | |
|
|
| BoolQ (zero-shot) | Accuracy | **84.9** | 79.6 | 81.2 | 82.4 | 86.7 | |
|
|
|
|
|
### Reasoning and Common Sense |
|
|
|
|
|
| Benchmark | Metric | Helion-V1.5-XL | Helion-V1.5 | LLaMA-2-13B | Mistral-7B | GPT-3.5-Turbo | |
|
|
|-----------|--------|----------------|-------------|-------------|------------|---------------| |
|
|
| GSM8K (8-shot) | Accuracy | **71.6** | 48.2 | 28.7 | 52.2 | 57.1 | |
|
|
| MATH (4-shot) | Accuracy | **34.7** | 18.9 | 13.5 | 28.4 | 34.1 | |
|
|
| BBH (3-shot) | Average | **61.8** | 49.3 | 47.2 | 56.1 | 65.4 | |
|
|
| DROP (3-shot) | F1 Score | **69.4** | 58.7 | 62.1 | 64.8 | 73.2 | |
|
|
| CommonsenseQA (7-shot) | Accuracy | **76.9** | 68.4 | 70.1 | 73.2 | 79.1 | |
|
|
|
|
|
### Code Generation and Understanding |
|
|
|
|
|
| Benchmark | Metric | Helion-V1.5-XL | Helion-V1.5 | LLaMA-2-13B | CodeLLaMA-13B | GPT-3.5-Turbo | |
|
|
|-----------|--------|----------------|-------------|-------------|---------------|---------------| |
|
|
| HumanEval (pass@1) | Pass Rate | **67.8** | 45.2 | 29.3 | 46.2 | 48.1 | |
|
|
| HumanEval (pass@10) | Pass Rate | **84.3** | 67.9 | 54.1 | 71.8 | 72.5 | |
|
|
| MBPP (pass@1) | Pass Rate | **72.4** | 53.8 | 42.7 | 58.3 | 61.2 | |
|
|
| MBPP (pass@10) | Pass Rate | **87.6** | 74.1 | 68.4 | 79.5 | 81.9 | |
|
|
| DS-1000 | Pass Rate | **48.9** | 32.1 | 28.4 | 41.7 | 52.3 | |
|
|
| CodeXGLUE | Average | **81.2** | 69.4 | 65.8 | 74.6 | 83.7 | |
|
|
|
|
|
### Multilingual Performance |
|
|
|
|
|
| Language | FLORES-101 (BLEU) | XNLI (Accuracy) | XStoryCloze (Accuracy) | |
|
|
|----------|-------------------|-----------------|------------------------| |
|
|
| English | 100.0 (reference) | 89.4 | 91.2 | |
|
|
| Spanish | 87.3 | 84.6 | 86.9 | |
|
|
| French | 86.9 | 83.8 | 85.4 | |
|
|
| German | 85.1 | 82.7 | 84.1 | |
|
|
| Chinese (Simplified) | 82.4 | 81.3 | 83.7 | |
|
|
| Japanese | 81.8 | 79.8 | 82.4 | |
|
|
| Korean | 80.9 | 78.6 | 81.1 | |
|
|
| Russian | 79.7 | 80.2 | 82.8 | |
|
|
| Arabic | 77.3 | 76.4 | 78.9 | |
|
|
| Hindi | 76.8 | 75.1 | 77.6 | |
|
|
| Portuguese | 86.1 | 83.2 | 85.7 | |
|
|
| Italian | 85.4 | 82.9 | 84.8 | |
|
|
|
|
|
### Truthfulness and Safety |
|
|
|
|
|
| Benchmark | Metric | Helion-V1.5-XL | Helion-V1.5 | LLaMA-2-13B | GPT-3.5-Turbo | |
|
|
|-----------|--------|----------------|-------------|-------------|---------------| |
|
|
| TruthfulQA | MC1 | **61.3** | 45.8 | 50.2 | 47.0 | |
|
|
| TruthfulQA | MC2 | **73.8** | 62.1 | 65.4 | 64.2 | |
|
|
| ToxiGen | Toxicity | **2.1%** | 3.8% | 4.2% | 1.9% | |
|
|
| BOLD | Bias Score | **0.34** | 0.47 | 0.51 | 0.29 | |
|
|
|
|
|
### Long Context Understanding |
|
|
|
|
|
| Benchmark | Context Length | Metric | Helion-V1.5-XL | LLaMA-2-13B | GPT-3.5-Turbo | |
|
|
|-----------|----------------|--------|----------------|-------------|---------------| |
|
|
| SCROLLS (QuALITY) | 4K-6K | F1 | **71.4** | 62.8 | 73.9 | |
|
|
| SCROLLS (Qasper) | 3K-5K | F1 | **68.7** | 59.3 | 71.2 | |
|
|
| LongBench (SingleDoc QA) | 8K-12K | Accuracy | **63.2** | 51.7 | 67.8 | |
|
|
| LongBench (MultiDoc QA) | 10K-16K | Accuracy | **58.9** | 44.3 | 63.4 | |
|
|
|
|
|
## Training Methodology |
|
|
|
|
|
### Dataset Composition |
|
|
|
|
|
The training corpus consists of 4.5 trillion tokens sampled from the following sources: |
|
|
|
|
|
| Data Source | Token Count | Percentage | Description | |
|
|
|-------------|-------------|------------|-------------| |
|
|
| Filtered Web Text | 2.025T | 45% | CommonCrawl filtered for quality, deduplicated | |
|
|
| Books and Literature | 900B | 20% | Fiction, non-fiction, technical books | |
|
|
| Code Repositories | 675B | 15% | GitHub, StackOverflow, documentation | |
|
|
| Scientific Papers | 450B | 10% | ArXiv, PubMed, academic repositories | |
|
|
| Instruction Data | 360B | 8% | Curated instruction-response pairs | |
|
|
| Multilingual Corpora | 90B | 2% | Parallel texts, translations, non-English web | |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
``` |
|
|
Compute Resources: 512x NVIDIA A100 80GB GPUs |
|
|
Total Training Time: 672 hours (28 days) |
|
|
Framework: PyTorch 2.0.1 with FSDP |
|
|
Distributed Strategy: Fully Sharded Data Parallel (FSDP) |
|
|
Mixed Precision: bfloat16 with stochastic rounding |
|
|
Communication Backend: NCCL with InfiniBand |
|
|
|
|
|
Total FLOPs: ~8.2e24 FLOPs |
|
|
GPU Hours: ~344,064 GPU-hours |
|
|
Peak Memory per GPU: 72GB |
|
|
Interconnect Bandwidth: 400 Gbps per GPU |
|
|
``` |
|
|
|
|
|
### Optimization Configuration |
|
|
|
|
|
``` |
|
|
Optimizer: AdamW |
|
|
Beta1: 0.9 |
|
|
Beta2: 0.95 |
|
|
Epsilon: 1e-8 |
|
|
Weight Decay: 0.1 |
|
|
Gradient Clipping: 1.0 |
|
|
|
|
|
Learning Rate Schedule: Cosine with Warmup |
|
|
Peak Learning Rate: 3.0e-4 |
|
|
Minimum Learning Rate: 3.0e-5 |
|
|
Warmup Steps: 2,000 |
|
|
Total Training Steps: 875,000 |
|
|
|
|
|
Batch Configuration: |
|
|
Global Batch Size: 4,194,304 tokens |
|
|
Micro Batch Size: 32 samples |
|
|
Gradient Accumulation: 8 steps |
|
|
Sequence Length: 4,096 tokens |
|
|
|
|
|
Checkpointing: |
|
|
Activation Checkpointing: Enabled |
|
|
Checkpoint Interval: 5,000 steps |
|
|
Total Checkpoints Saved: 175 |
|
|
``` |
|
|
|
|
|
### Training Stages |
|
|
|
|
|
#### Stage 1: Pre-training (3.8T tokens) |
|
|
- Duration: 750,000 steps |
|
|
- Objective: Next-token prediction |
|
|
- Data: General corpus (web, books, code, scientific) |
|
|
- Learning Rate: Full cosine schedule |
|
|
|
|
|
#### Stage 2: Domain Adaptation (500B tokens) |
|
|
- Duration: 80,000 steps |
|
|
- Objective: Continued pre-training on specialized domains |
|
|
- Data: Enhanced code, mathematics, scientific reasoning |
|
|
- Learning Rate: 1.0e-4 constant |
|
|
|
|
|
#### Stage 3: Instruction Tuning (200B tokens) |
|
|
- Duration: 45,000 steps |
|
|
- Objective: Instruction following and task alignment |
|
|
- Data: High-quality instruction-response pairs |
|
|
- Learning Rate: 5.0e-5 with linear decay |
|
|
|
|
|
## Installation and Usage |
|
|
|
|
|
### Requirements |
|
|
|
|
|
```bash |
|
|
pip install torch>=2.0.0 transformers>=4.35.0 accelerate>=0.24.0 |
|
|
``` |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_id = "DeepXR/Helion-V1.5-XL" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
prompt = "Explain the concept of quantum entanglement:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### 4-bit Quantization |
|
|
|
|
|
```python |
|
|
from transformers import BitsAndBytesConfig |
|
|
|
|
|
quantization_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
|
bnb_4bit_use_double_quant=True, |
|
|
bnb_4bit_quant_type="nf4" |
|
|
) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
quantization_config=quantization_config, |
|
|
device_map="auto" |
|
|
) |
|
|
``` |
|
|
|
|
|
### Chat Format |
|
|
|
|
|
```python |
|
|
conversation = [ |
|
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": "What are the implications of the P vs NP problem?"} |
|
|
] |
|
|
|
|
|
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
``` |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Memory Requirements (Inference) |
|
|
|
|
|
| Precision | Memory Required | Recommended GPU | |
|
|
|-----------|----------------|-----------------| |
|
|
| FP32 | 64.9 GB | 2x A100 80GB | |
|
|
| BF16/FP16 | 32.5 GB | A100 40GB, A6000 | |
|
|
| INT8 | 16.8 GB | RTX 4090, A40 | |
|
|
| INT4 (NF4) | 9.2 GB | RTX 3090, RTX 4080 | |
|
|
|
|
|
### Inference Performance |
|
|
|
|
|
| Hardware | Precision | Tokens/Second | Batch Size | |
|
|
|----------|-----------|---------------|------------| |
|
|
| A100 80GB | BF16 | 47.3 | 1 | |
|
|
| A100 80GB | INT8 | 89.6 | 1 | |
|
|
| A100 80GB | INT4 | 134.2 | 1 | |
|
|
| H100 80GB | BF16 | 78.1 | 1 | |
|
|
| H100 80GB | INT4 | 218.7 | 1 | |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
|
|
|
1. **Knowledge Cutoff**: Training data extends through January 2024. The model lacks awareness of subsequent events. |
|
|
|
|
|
2. **Hallucination**: The model may generate plausible but factually incorrect information with high confidence. |
|
|
|
|
|
3. **Arithmetic Precision**: While improved over baseline, complex multi-step mathematical computations may contain errors. |
|
|
|
|
|
4. **Context Length Degradation**: Performance decreases beyond 12,000 tokens despite 16,384 token capacity. |
|
|
|
|
|
5. **Specialized Domain Knowledge**: May lack depth in highly specialized technical, medical, or legal domains. |
|
|
|
|
|
6. **Code Execution**: Generated code requires validation and testing before deployment. |
|
|
|
|
|
### Bias Analysis |
|
|
|
|
|
The model has been evaluated for biases across multiple dimensions: |
|
|
|
|
|
- **Gender Bias**: BOLD gender bias score of 0.34 (lower is better) |
|
|
- **Racial Bias**: Demonstrates residual stereotypical associations in certain contexts |
|
|
- **Geographic Bias**: Western-centric knowledge distribution |
|
|
- **Language Bias**: Performance degrades for lower-resource languages |
|
|
|
|
|
Mitigation strategies include balanced dataset sampling, bias-aware fine-tuning, and constitutional AI principles during alignment. |
|
|
|
|
|
## Evaluation Methodology |
|
|
|
|
|
All benchmarks were evaluated using the Language Model Evaluation Harness (lm-evaluation-harness) with standardized few-shot settings. Code evaluation used the standard HumanEval and MBPP test suites with temperature 0.2 sampling. Multilingual benchmarks employed zero-shot evaluation for consistency. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache License 2.0. |
|
|
|
|
|
``` |
|
|
Copyright 2025 DeepXR |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{helion-v15-xl-2024, |
|
|
title={Helion-V1.5-XL: A 16B Parameter Instruction-Tuned Language Model}, |
|
|
author={DeepXR Team}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/DeepXR/Helion-V1.5-XL} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Training infrastructure provided by advanced cloud computing resources. Dataset curation benefited from open-source contributions including The Pile, RedPajama, and community-curated instruction datasets. |