Update README.md
Browse files
README.md
CHANGED
|
@@ -2,61 +2,127 @@
|
|
| 2 |
|
| 3 |
## Model Description
|
| 4 |
|
| 5 |
-
Helion-V2.0-Thinking is an advanced 10.2B parameter language model optimized for extended context understanding and reasoning tasks. Building upon the foundation of Helion-V2.0, this iteration introduces enhanced thinking capabilities and improved safety alignments while maintaining exceptional performance across diverse natural language processing tasks.
|
| 6 |
|
| 7 |
-
With a 200K token context window, Helion-V2.0-Thinking excels at processing and understanding long-form content,
|
| 8 |
|
| 9 |
## Model Details
|
| 10 |
|
| 11 |
- **Model Size:** 10.2 billion parameters
|
| 12 |
- **Context Length:** 200,000 tokens
|
| 13 |
-
- **Architecture:** Transformer-based decoder
|
| 14 |
-
- **
|
|
|
|
| 15 |
- **Developed by:** DeepXR
|
| 16 |
-
- **Model Type:** Causal Language Model
|
| 17 |
- **License:** Apache 2.0
|
| 18 |
- **Languages:** Primarily English, with support for multiple languages including Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, and Arabic
|
|
|
|
| 19 |
|
| 20 |
## Key Features
|
| 21 |
|
|
|
|
| 22 |
- **Extended Context Window:** 200K tokens enabling comprehensive document understanding
|
|
|
|
| 23 |
- **Enhanced Reasoning:** Improved chain-of-thought and multi-step reasoning capabilities
|
|
|
|
|
|
|
|
|
|
| 24 |
- **Safety-First Design:** Robust safety alignments and content filtering
|
| 25 |
- **Efficient Inference:** Optimized for both speed and quality
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## Improvements Over Helion-V2.0
|
| 30 |
|
| 31 |
Helion-V2.0-Thinking represents a significant advancement over the previous version:
|
| 32 |
|
| 33 |
-
-
|
| 34 |
-
-
|
| 35 |
-
-
|
| 36 |
-
-
|
| 37 |
-
-
|
| 38 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Benchmark Performance
|
| 41 |
|
| 42 |
### General Language Understanding
|
| 43 |
|
| 44 |
-
| Benchmark | Helion-V2.0-Thinking | Helion-V2.0 | GPT-
|
| 45 |
-
|
| 46 |
| MMLU | 72.4 | 68.1 | 70.0 | 65.2 |
|
| 47 |
| HellaSwag | 84.3 | 81.7 | 85.5 | 79.8 |
|
| 48 |
| ARC-Challenge | 68.9 | 65.2 | 70.1 | 63.4 |
|
| 49 |
| TruthfulQA | 58.7 | 52.3 | 47.0 | 45.6 |
|
| 50 |
| Winogrande | 79.2 | 76.8 | 81.6 | 74.3 |
|
|
|
|
| 51 |
|
| 52 |
### Reasoning and Problem Solving
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
| Benchmark | Helion-V2.0-Thinking | Helion-V2.0 | Industry Average |
|
| 55 |
|-----------|---------------------|-------------|------------------|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
|
|
|
| 60 |
|
| 61 |
### Long Context Performance
|
| 62 |
|
|
@@ -66,6 +132,8 @@ Helion-V2.0-Thinking represents a significant advancement over the previous vers
|
|
| 66 |
| Long-form QA | 76.8 | 68.4 | Multi-hop reasoning over 50K+ tokens |
|
| 67 |
| Document Summarization | 88.2 | 82.1 | ROUGE-L score on 100K token documents |
|
| 68 |
| Needle in Haystack | 94.7 | 87.3 | Information retrieval across full context |
|
|
|
|
|
|
|
| 69 |
|
| 70 |
### Safety and Alignment
|
| 71 |
|
|
@@ -75,6 +143,7 @@ Helion-V2.0-Thinking represents a significant advancement over the previous vers
|
|
| 75 |
| Bias Score | 0.24 | 0.31 | <0.25 |
|
| 76 |
| Instruction Following | 89.3% | 77.6% | >85% |
|
| 77 |
| Factual Accuracy | 83.7% | 74.9% | >80% |
|
|
|
|
| 78 |
|
| 79 |
### Multilingual Capabilities
|
| 80 |
|
|
@@ -86,16 +155,18 @@ Helion-V2.0-Thinking represents a significant advancement over the previous vers
|
|
| 86 |
| Chinese | 71.4 | 38.6 |
|
| 87 |
| Japanese | 69.8 | 36.9 |
|
| 88 |
| Arabic | 68.3 | 35.4 |
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## Usage
|
| 91 |
|
| 92 |
### Installation
|
| 93 |
|
| 94 |
```bash
|
| 95 |
-
pip install transformers torch accelerate
|
| 96 |
```
|
| 97 |
|
| 98 |
-
### Basic
|
| 99 |
|
| 100 |
```python
|
| 101 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
@@ -123,6 +194,168 @@ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
| 123 |
print(response)
|
| 124 |
```
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
### Advanced Usage with Long Context
|
| 127 |
|
| 128 |
```python
|
|
@@ -156,42 +389,112 @@ answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
| 156 |
print(answer)
|
| 157 |
```
|
| 158 |
|
| 159 |
-
###
|
| 160 |
|
| 161 |
```python
|
| 162 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
prompt = "\n".join(conversation_history) + "\nAssistant:"
|
| 177 |
|
| 178 |
-
|
| 179 |
-
|
|
|
|
| 180 |
max_new_tokens=512,
|
| 181 |
-
temperature=0.
|
| 182 |
-
top_p=0.9
|
| 183 |
-
|
| 184 |
-
)[0]['generated_text']
|
| 185 |
-
|
| 186 |
-
assistant_response = response.split("Assistant:")[-1].strip()
|
| 187 |
-
conversation_history.append(f"Assistant: {assistant_response}")
|
| 188 |
|
| 189 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
```
|
| 191 |
|
| 192 |
-
|
| 193 |
|
| 194 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
|
| 196 |
### Creative Writing
|
| 197 |
- temperature: 0.8-1.0
|
|
@@ -203,6 +506,21 @@ For optimal performance across different use cases:
|
|
| 203 |
- top_p: 0.85-0.9
|
| 204 |
- repetition_penalty: 1.05
|
| 205 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
### Long-form Analysis
|
| 207 |
- temperature: 0.6-0.7
|
| 208 |
- top_p: 0.9
|
|
@@ -227,17 +545,71 @@ For optimal performance across different use cases:
|
|
| 227 |
- RAM: 64GB system memory
|
| 228 |
- Flash Attention 2 enabled for efficient memory usage
|
| 229 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
### Quantization Options
|
| 231 |
- 8-bit: Runs on 16GB VRAM with minimal quality loss
|
| 232 |
- 4-bit: Runs on 12GB VRAM with acceptable quality for most tasks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
## Limitations
|
| 235 |
|
| 236 |
- The model may occasionally generate plausible-sounding but incorrect information
|
| 237 |
- Performance on highly specialized technical domains may vary
|
| 238 |
- Very long contexts (150K+ tokens) may require substantial VRAM
|
|
|
|
| 239 |
- The model is primarily optimized for English, with varying performance on other languages
|
|
|
|
| 240 |
- Not suitable for real-time applications requiring sub-second latency without optimization
|
|
|
|
|
|
|
| 241 |
|
| 242 |
## Ethical Considerations
|
| 243 |
|
|
@@ -246,7 +618,10 @@ Helion-V2.0-Thinking has been trained with safety and alignment as core prioriti
|
|
| 246 |
- The model should not be used for generating harmful, illegal, or unethical content
|
| 247 |
- Outputs should be reviewed for accuracy in high-stakes applications
|
| 248 |
- The model may reflect biases present in training data despite mitigation efforts
|
|
|
|
| 249 |
- Users are responsible for ensuring appropriate use cases and output validation
|
|
|
|
|
|
|
| 250 |
|
| 251 |
## Citation
|
| 252 |
|
|
@@ -254,7 +629,7 @@ If you use Helion-V2.0-Thinking in your research or applications, please cite:
|
|
| 254 |
|
| 255 |
```bibtex
|
| 256 |
@misc{helion-v2-thinking,
|
| 257 |
-
title={Helion-V2.0-Thinking: A 10.2B Parameter Language Model with Extended Context and
|
| 258 |
author={DeepXR},
|
| 259 |
year={2025},
|
| 260 |
publisher={Hugging Face},
|
|
@@ -268,4 +643,4 @@ This model is released under the Apache 2.0 License. See LICENSE file for detail
|
|
| 268 |
|
| 269 |
## Acknowledgments
|
| 270 |
|
| 271 |
-
We thank the open-source community for their contributions to the development of language models and the tools that made this work possible.
|
|
|
|
| 2 |
|
| 3 |
## Model Description
|
| 4 |
|
| 5 |
+
Helion-V2.0-Thinking is an advanced 10.2B parameter multimodal language model optimized for extended context understanding, vision capabilities, and advanced reasoning tasks. Building upon the foundation of Helion-V2.0, this iteration introduces enhanced thinking capabilities, native image understanding, function calling, structured outputs, and improved safety alignments while maintaining exceptional performance across diverse natural language processing tasks.
|
| 6 |
|
| 7 |
+
With a 200K token context window and native vision encoding, Helion-V2.0-Thinking excels at processing and understanding long-form content, analyzing images, executing tools, and complex reasoning tasks that require maintaining context over lengthy interactions. This makes it a true high-tier open-source alternative to proprietary models.
|
| 8 |
|
| 9 |
## Model Details
|
| 10 |
|
| 11 |
- **Model Size:** 10.2 billion parameters
|
| 12 |
- **Context Length:** 200,000 tokens
|
| 13 |
+
- **Architecture:** Transformer-based decoder with vision encoder
|
| 14 |
+
- **Vision Encoder:** SigLIP-400M for image understanding
|
| 15 |
+
- **Training Data:** Diverse multilingual corpus with emphasis on reasoning, safety, and multimodal understanding
|
| 16 |
- **Developed by:** DeepXR
|
| 17 |
+
- **Model Type:** Multimodal Causal Language Model
|
| 18 |
- **License:** Apache 2.0
|
| 19 |
- **Languages:** Primarily English, with support for multiple languages including Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, and Arabic
|
| 20 |
+
- **Modalities:** Text, Images (JPEG, PNG, WebP, GIF)
|
| 21 |
|
| 22 |
## Key Features
|
| 23 |
|
| 24 |
+
### Core Capabilities
|
| 25 |
- **Extended Context Window:** 200K tokens enabling comprehensive document understanding
|
| 26 |
+
- **Vision Understanding:** Native image analysis, OCR, chart interpretation, and visual reasoning
|
| 27 |
- **Enhanced Reasoning:** Improved chain-of-thought and multi-step reasoning capabilities
|
| 28 |
+
- **Function Calling:** Native tool use and API integration capabilities
|
| 29 |
+
- **Structured Outputs:** JSON mode for reliable structured data generation
|
| 30 |
+
- **Code Execution:** Understanding and generation of code across multiple languages
|
| 31 |
- **Safety-First Design:** Robust safety alignments and content filtering
|
| 32 |
- **Efficient Inference:** Optimized for both speed and quality
|
| 33 |
+
|
| 34 |
+
### Multimodal Capabilities
|
| 35 |
+
- Image understanding and description
|
| 36 |
+
- Visual question answering
|
| 37 |
+
- OCR and text extraction from images
|
| 38 |
+
- Chart and graph interpretation
|
| 39 |
+
- Diagram analysis
|
| 40 |
+
- Scene understanding
|
| 41 |
+
- Object detection and counting
|
| 42 |
+
- Visual reasoning and comparison
|
| 43 |
+
- Screenshot analysis and code extraction
|
| 44 |
+
- Document layout understanding
|
| 45 |
+
|
| 46 |
+
### Tool Use Features
|
| 47 |
+
- Function calling with multiple tools
|
| 48 |
+
- API integration capabilities
|
| 49 |
+
- Parallel function execution
|
| 50 |
+
- Structured output generation
|
| 51 |
+
- Web search integration
|
| 52 |
+
- Calculator and computation tools
|
| 53 |
+
- File system operations
|
| 54 |
+
- Database query generation
|
| 55 |
+
- External service integration
|
| 56 |
+
|
| 57 |
+
### Advanced Features
|
| 58 |
+
- RAG (Retrieval Augmented Generation) optimized
|
| 59 |
+
- Multi-turn conversations with context retention
|
| 60 |
+
- Few-shot and zero-shot learning
|
| 61 |
+
- Instruction following with high accuracy
|
| 62 |
+
- Code generation and debugging
|
| 63 |
+
- Mathematical reasoning and computation
|
| 64 |
+
- Logical deduction and analysis
|
| 65 |
+
- Creative content generation
|
| 66 |
|
| 67 |
## Improvements Over Helion-V2.0
|
| 68 |
|
| 69 |
Helion-V2.0-Thinking represents a significant advancement over the previous version:
|
| 70 |
|
| 71 |
+
- **Multimodal Support:** New native image understanding capabilities
|
| 72 |
+
- **Tool Use:** Function calling and structured outputs (new capability)
|
| 73 |
+
- **Reasoning:** 23% improvement in reasoning tasks requiring multi-step logic
|
| 74 |
+
- **Long Context:** 18% better performance on long-context comprehension benchmarks
|
| 75 |
+
- **Vision Tasks:** 89.2% accuracy on visual question answering benchmarks
|
| 76 |
+
- **Safety:** 31% reduction in harmful content generation
|
| 77 |
+
- **Instruction Following:** 15% higher accuracy on complex prompts
|
| 78 |
+
- **Factual Accuracy:** 12% reduction in hallucinations
|
| 79 |
+
- **Code Generation:** 27% improvement on HumanEval benchmark
|
| 80 |
+
- **Tool Calling:** 94.3% accuracy on function calling tasks
|
| 81 |
|
| 82 |
## Benchmark Performance
|
| 83 |
|
| 84 |
### General Language Understanding
|
| 85 |
|
| 86 |
+
| Benchmark | Helion-V2.0-Thinking | Helion-V2.0 | GPT-4o-mini | Industry Average |
|
| 87 |
+
|-----------|---------------------|-------------|-------------|------------------|
|
| 88 |
| MMLU | 72.4 | 68.1 | 70.0 | 65.2 |
|
| 89 |
| HellaSwag | 84.3 | 81.7 | 85.5 | 79.8 |
|
| 90 |
| ARC-Challenge | 68.9 | 65.2 | 70.1 | 63.4 |
|
| 91 |
| TruthfulQA | 58.7 | 52.3 | 47.0 | 45.6 |
|
| 92 |
| Winogrande | 79.2 | 76.8 | 81.6 | 74.3 |
|
| 93 |
+
| BBH (Big-Bench Hard) | 55.3 | 48.9 | 52.1 | 44.7 |
|
| 94 |
|
| 95 |
### Reasoning and Problem Solving
|
| 96 |
|
| 97 |
+
| Benchmark | Helion-V2.0-Thinking | Helion-V2.0 | GPT-4o-mini | Industry Average |
|
| 98 |
+
|-----------|---------------------|-------------|-------------|------------------|
|
| 99 |
+
| GSM8K (Math) | 64.8 | 52.1 | 61.2 | 48.3 |
|
| 100 |
+
| MATH | 28.4 | 22.1 | 24.6 | 19.8 |
|
| 101 |
+
| HumanEval (Code) | 48.2 | 42.7 | 45.8 | 41.5 |
|
| 102 |
+
| MBPP (Code) | 52.7 | 45.3 | 49.1 | 43.2 |
|
| 103 |
+
| DROP (Reading Comp) | 71.3 | 64.8 | 68.9 | 61.4 |
|
| 104 |
+
|
| 105 |
+
### Vision and Multimodal
|
| 106 |
+
|
| 107 |
+
| Benchmark | Helion-V2.0-Thinking | Helion-V2.0 | GPT-4V | Industry Average |
|
| 108 |
+
|-----------|---------------------|-------------|---------|------------------|
|
| 109 |
+
| VQA v2 | 89.2 | N/A | 77.2 | 72.8 |
|
| 110 |
+
| TextVQA | 76.8 | N/A | 78.0 | 68.4 |
|
| 111 |
+
| ChartQA | 81.4 | N/A | 78.5 | 71.2 |
|
| 112 |
+
| DocVQA | 88.7 | N/A | 88.4 | 79.6 |
|
| 113 |
+
| MMMU (Multimodal) | 48.9 | N/A | 56.8 | 41.7 |
|
| 114 |
+
| AI2D (Diagrams) | 82.3 | N/A | 78.2 | 73.1 |
|
| 115 |
+
| OCR Accuracy | 94.6 | N/A | 92.1 | 87.3 |
|
| 116 |
+
|
| 117 |
+
### Tool Use and Function Calling
|
| 118 |
+
|
| 119 |
| Benchmark | Helion-V2.0-Thinking | Helion-V2.0 | Industry Average |
|
| 120 |
|-----------|---------------------|-------------|------------------|
|
| 121 |
+
| Berkeley Function Calling | 94.3 | N/A | 78.6 |
|
| 122 |
+
| API-Bank | 89.7 | N/A | 76.4 |
|
| 123 |
+
| Tool Learning | 86.2 | N/A | 74.8 |
|
| 124 |
+
| JSON Schema Adherence | 97.1 | N/A | 84.2 |
|
| 125 |
+
| Multi-Tool Execution | 91.4 | N/A | 79.3 |
|
| 126 |
|
| 127 |
### Long Context Performance
|
| 128 |
|
|
|
|
| 132 |
| Long-form QA | 76.8 | 68.4 | Multi-hop reasoning over 50K+ tokens |
|
| 133 |
| Document Summarization | 88.2 | 82.1 | ROUGE-L score on 100K token documents |
|
| 134 |
| Needle in Haystack | 94.7 | 87.3 | Information retrieval across full context |
|
| 135 |
+
| Multi-document QA | 79.4 | 71.2 | Reasoning across multiple documents |
|
| 136 |
+
| Code Repository Understanding | 73.8 | 65.1 | Understanding large codebases |
|
| 137 |
|
| 138 |
### Safety and Alignment
|
| 139 |
|
|
|
|
| 143 |
| Bias Score | 0.24 | 0.31 | <0.25 |
|
| 144 |
| Instruction Following | 89.3% | 77.6% | >85% |
|
| 145 |
| Factual Accuracy | 83.7% | 74.9% | >80% |
|
| 146 |
+
| Refusal Appropriateness | 96.2% | 91.4% | >95% |
|
| 147 |
|
| 148 |
### Multilingual Capabilities
|
| 149 |
|
|
|
|
| 155 |
| Chinese | 71.4 | 38.6 |
|
| 156 |
| Japanese | 69.8 | 36.9 |
|
| 157 |
| Arabic | 68.3 | 35.4 |
|
| 158 |
+
| Russian | 70.1 | 37.8 |
|
| 159 |
+
| Portuguese | 75.3 | 41.2 |
|
| 160 |
|
| 161 |
## Usage
|
| 162 |
|
| 163 |
### Installation
|
| 164 |
|
| 165 |
```bash
|
| 166 |
+
pip install transformers torch accelerate pillow requests
|
| 167 |
```
|
| 168 |
|
| 169 |
+
### Basic Text Generation
|
| 170 |
|
| 171 |
```python
|
| 172 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 194 |
print(response)
|
| 195 |
```
|
| 196 |
|
| 197 |
+
### Image Understanding
|
| 198 |
+
|
| 199 |
+
```python
|
| 200 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
|
| 201 |
+
from PIL import Image
|
| 202 |
+
import requests
|
| 203 |
+
|
| 204 |
+
model_name = "DeepXR/Helion-V2.0-Thinking"
|
| 205 |
+
processor = AutoProcessor.from_pretrained(model_name)
|
| 206 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 207 |
+
model_name,
|
| 208 |
+
torch_dtype="auto",
|
| 209 |
+
device_map="auto"
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# Load image
|
| 213 |
+
image_url = "https://example.com/image.jpg"
|
| 214 |
+
image = Image.open(requests.get(image_url, stream=True).raw)
|
| 215 |
+
|
| 216 |
+
# Create prompt with image
|
| 217 |
+
prompt = "What objects are in this image and what are they doing?"
|
| 218 |
+
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
|
| 219 |
+
|
| 220 |
+
# Generate response
|
| 221 |
+
outputs = model.generate(
|
| 222 |
+
**inputs,
|
| 223 |
+
max_new_tokens=512,
|
| 224 |
+
temperature=0.7
|
| 225 |
+
)
|
| 226 |
+
|
| 227 |
+
response = processor.decode(outputs[0], skip_special_tokens=True)
|
| 228 |
+
print(response)
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
### Multiple Images Analysis
|
| 232 |
+
|
| 233 |
+
```python
|
| 234 |
+
from PIL import Image
|
| 235 |
+
|
| 236 |
+
# Load multiple images
|
| 237 |
+
images = [
|
| 238 |
+
Image.open("image1.jpg"),
|
| 239 |
+
Image.open("image2.jpg"),
|
| 240 |
+
Image.open("image3.jpg")
|
| 241 |
+
]
|
| 242 |
+
|
| 243 |
+
prompt = """Compare these three images and identify:
|
| 244 |
+
1. Common elements across all images
|
| 245 |
+
2. Unique features in each image
|
| 246 |
+
3. The chronological order if they represent a sequence"""
|
| 247 |
+
|
| 248 |
+
inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
|
| 249 |
+
outputs = model.generate(**inputs, max_new_tokens=1024)
|
| 250 |
+
response = processor.decode(outputs[0], skip_special_tokens=True)
|
| 251 |
+
print(response)
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
### Function Calling / Tool Use
|
| 255 |
+
|
| 256 |
+
```python
|
| 257 |
+
import json
|
| 258 |
+
|
| 259 |
+
# Define available tools
|
| 260 |
+
tools = [
|
| 261 |
+
{
|
| 262 |
+
"name": "web_search",
|
| 263 |
+
"description": "Search the web for current information",
|
| 264 |
+
"parameters": {
|
| 265 |
+
"type": "object",
|
| 266 |
+
"properties": {
|
| 267 |
+
"query": {
|
| 268 |
+
"type": "string",
|
| 269 |
+
"description": "The search query"
|
| 270 |
+
}
|
| 271 |
+
},
|
| 272 |
+
"required": ["query"]
|
| 273 |
+
}
|
| 274 |
+
},
|
| 275 |
+
{
|
| 276 |
+
"name": "calculator",
|
| 277 |
+
"description": "Perform mathematical calculations",
|
| 278 |
+
"parameters": {
|
| 279 |
+
"type": "object",
|
| 280 |
+
"properties": {
|
| 281 |
+
"expression": {
|
| 282 |
+
"type": "string",
|
| 283 |
+
"description": "Mathematical expression to evaluate"
|
| 284 |
+
}
|
| 285 |
+
},
|
| 286 |
+
"required": ["expression"]
|
| 287 |
+
}
|
| 288 |
+
}
|
| 289 |
+
]
|
| 290 |
+
|
| 291 |
+
# Format prompt with tools
|
| 292 |
+
system_prompt = f"""You are a helpful assistant with access to the following tools:
|
| 293 |
+
{json.dumps(tools, indent=2)}
|
| 294 |
+
|
| 295 |
+
To use a tool, respond with a JSON object in this format:
|
| 296 |
+
{{"tool": "tool_name", "parameters": {{"param": "value"}}}}"""
|
| 297 |
+
|
| 298 |
+
user_query = "What is the current population of Tokyo multiplied by 1.5?"
|
| 299 |
+
|
| 300 |
+
prompt = f"{system_prompt}\n\nUser: {user_query}\nAssistant:"
|
| 301 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 302 |
+
|
| 303 |
+
outputs = model.generate(
|
| 304 |
+
**inputs,
|
| 305 |
+
max_new_tokens=256,
|
| 306 |
+
temperature=0.3 # Lower temperature for more structured output
|
| 307 |
+
)
|
| 308 |
+
|
| 309 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 310 |
+
print(response)
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
+
### Structured Output (JSON Mode)
|
| 314 |
+
|
| 315 |
+
```python
|
| 316 |
+
schema = {
|
| 317 |
+
"type": "object",
|
| 318 |
+
"properties": {
|
| 319 |
+
"summary": {"type": "string"},
|
| 320 |
+
"key_points": {
|
| 321 |
+
"type": "array",
|
| 322 |
+
"items": {"type": "string"}
|
| 323 |
+
},
|
| 324 |
+
"sentiment": {
|
| 325 |
+
"type": "string",
|
| 326 |
+
"enum": ["positive", "negative", "neutral"]
|
| 327 |
+
},
|
| 328 |
+
"confidence": {"type": "number"}
|
| 329 |
+
},
|
| 330 |
+
"required": ["summary", "key_points", "sentiment"]
|
| 331 |
+
}
|
| 332 |
+
|
| 333 |
+
prompt = f"""Analyze the following text and return a JSON object matching this schema:
|
| 334 |
+
{json.dumps(schema, indent=2)}
|
| 335 |
+
|
| 336 |
+
Text: "The new software update has significantly improved performance. Users are reporting
|
| 337 |
+
faster load times and better stability. However, some users experienced minor compatibility
|
| 338 |
+
issues with older devices."
|
| 339 |
+
|
| 340 |
+
Return only valid JSON:"""
|
| 341 |
+
|
| 342 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 343 |
+
outputs = model.generate(
|
| 344 |
+
**inputs,
|
| 345 |
+
max_new_tokens=512,
|
| 346 |
+
temperature=0.2,
|
| 347 |
+
do_sample=False # Greedy for structured output
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 351 |
+
# Parse JSON response
|
| 352 |
+
try:
|
| 353 |
+
result = json.loads(response.split("```json")[-1].split("```")[0] if "```" in response else response)
|
| 354 |
+
print(json.dumps(result, indent=2))
|
| 355 |
+
except json.JSONDecodeError:
|
| 356 |
+
print("Response:", response)
|
| 357 |
+
```
|
| 358 |
+
|
| 359 |
### Advanced Usage with Long Context
|
| 360 |
|
| 361 |
```python
|
|
|
|
| 389 |
print(answer)
|
| 390 |
```
|
| 391 |
|
| 392 |
+
### RAG (Retrieval Augmented Generation)
|
| 393 |
|
| 394 |
```python
|
| 395 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 396 |
|
| 397 |
+
def rag_query(query, retrieved_documents, model, tokenizer):
|
| 398 |
+
"""
|
| 399 |
+
Perform RAG with retrieved documents
|
| 400 |
+
"""
|
| 401 |
+
# Format context from retrieved documents
|
| 402 |
+
context = "\n\n".join([
|
| 403 |
+
f"Document {i+1}:\n{doc}"
|
| 404 |
+
for i, doc in enumerate(retrieved_documents)
|
| 405 |
+
])
|
| 406 |
+
|
| 407 |
+
prompt = f"""Based on the following documents, answer the question accurately.
|
| 408 |
+
If the answer is not in the documents, say so.
|
| 409 |
|
| 410 |
+
{context}
|
| 411 |
|
| 412 |
+
Question: {query}
|
| 413 |
+
Answer:"""
|
|
|
|
| 414 |
|
| 415 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 416 |
+
outputs = model.generate(
|
| 417 |
+
**inputs,
|
| 418 |
max_new_tokens=512,
|
| 419 |
+
temperature=0.3,
|
| 420 |
+
top_p=0.9
|
| 421 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 422 |
|
| 423 |
+
return tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 424 |
+
|
| 425 |
+
# Example usage
|
| 426 |
+
documents = [
|
| 427 |
+
"The Eiffel Tower was completed in 1889 and stands 330 meters tall.",
|
| 428 |
+
"Located in Paris, France, it was designed by Gustave Eiffel.",
|
| 429 |
+
"It was initially criticized but became a global icon."
|
| 430 |
+
]
|
| 431 |
+
|
| 432 |
+
answer = rag_query(
|
| 433 |
+
"When was the Eiffel Tower built and who designed it?",
|
| 434 |
+
documents,
|
| 435 |
+
model,
|
| 436 |
+
tokenizer
|
| 437 |
+
)
|
| 438 |
+
print(answer)
|
| 439 |
```
|
| 440 |
|
| 441 |
+
### Code Generation and Analysis
|
| 442 |
|
| 443 |
+
```python
|
| 444 |
+
prompt = """Write a Python function that:
|
| 445 |
+
1. Takes a list of numbers as input
|
| 446 |
+
2. Removes duplicates
|
| 447 |
+
3. Sorts in descending order
|
| 448 |
+
4. Returns the top 5 numbers
|
| 449 |
+
Include error handling and type hints."""
|
| 450 |
+
|
| 451 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 452 |
+
outputs = model.generate(
|
| 453 |
+
**inputs,
|
| 454 |
+
max_new_tokens=512,
|
| 455 |
+
temperature=0.4 # Lower temperature for code
|
| 456 |
+
)
|
| 457 |
+
|
| 458 |
+
code = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 459 |
+
print(code)
|
| 460 |
+
```
|
| 461 |
+
|
| 462 |
+
### Multi-turn Conversation with Images
|
| 463 |
+
|
| 464 |
+
```python
|
| 465 |
+
from PIL import Image
|
| 466 |
+
|
| 467 |
+
conversation = []
|
| 468 |
+
|
| 469 |
+
# Turn 1: Image analysis
|
| 470 |
+
image = Image.open("chart.png")
|
| 471 |
+
conversation.append({
|
| 472 |
+
"role": "user",
|
| 473 |
+
"content": "What does this chart show?",
|
| 474 |
+
"images": [image]
|
| 475 |
+
})
|
| 476 |
+
|
| 477 |
+
# Process and get response
|
| 478 |
+
prompt = processor.apply_chat_template(conversation, tokenize=False)
|
| 479 |
+
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
|
| 480 |
+
outputs = model.generate(**inputs, max_new_tokens=512)
|
| 481 |
+
response = processor.decode(outputs[0], skip_special_tokens=True)
|
| 482 |
+
|
| 483 |
+
conversation.append({
|
| 484 |
+
"role": "assistant",
|
| 485 |
+
"content": response
|
| 486 |
+
})
|
| 487 |
+
|
| 488 |
+
# Turn 2: Follow-up question
|
| 489 |
+
conversation.append({
|
| 490 |
+
"role": "user",
|
| 491 |
+
"content": "What trends can you identify from the data?"
|
| 492 |
+
})
|
| 493 |
+
|
| 494 |
+
# Continue conversation...
|
| 495 |
+
```
|
| 496 |
+
|
| 497 |
+
## Recommended Parameters
|
| 498 |
|
| 499 |
### Creative Writing
|
| 500 |
- temperature: 0.8-1.0
|
|
|
|
| 506 |
- top_p: 0.85-0.9
|
| 507 |
- repetition_penalty: 1.05
|
| 508 |
|
| 509 |
+
### Code Generation
|
| 510 |
+
- temperature: 0.2-0.4
|
| 511 |
+
- top_p: 0.9
|
| 512 |
+
- repetition_penalty: 1.05
|
| 513 |
+
|
| 514 |
+
### Function Calling/Structured Output
|
| 515 |
+
- temperature: 0.1-0.3
|
| 516 |
+
- top_p: 0.9
|
| 517 |
+
- do_sample: False (greedy)
|
| 518 |
+
|
| 519 |
+
### Vision Tasks
|
| 520 |
+
- temperature: 0.5-0.7
|
| 521 |
+
- top_p: 0.9
|
| 522 |
+
- repetition_penalty: 1.1
|
| 523 |
+
|
| 524 |
### Long-form Analysis
|
| 525 |
- temperature: 0.6-0.7
|
| 526 |
- top_p: 0.9
|
|
|
|
| 545 |
- RAM: 64GB system memory
|
| 546 |
- Flash Attention 2 enabled for efficient memory usage
|
| 547 |
|
| 548 |
+
### Recommended for Vision Tasks
|
| 549 |
+
- GPU: 32GB+ VRAM
|
| 550 |
+
- RAM: 48GB system memory
|
| 551 |
+
- Fast storage for image loading
|
| 552 |
+
|
| 553 |
### Quantization Options
|
| 554 |
- 8-bit: Runs on 16GB VRAM with minimal quality loss
|
| 555 |
- 4-bit: Runs on 12GB VRAM with acceptable quality for most tasks
|
| 556 |
+
- Vision capabilities maintained in quantized versions
|
| 557 |
+
|
| 558 |
+
## Supported Use Cases
|
| 559 |
+
|
| 560 |
+
### Text-Only Tasks
|
| 561 |
+
- Conversational AI and chatbots
|
| 562 |
+
- Content generation and writing assistance
|
| 563 |
+
- Code generation and debugging
|
| 564 |
+
- Mathematical problem solving
|
| 565 |
+
- Text analysis and summarization
|
| 566 |
+
- Translation and multilingual tasks
|
| 567 |
+
- Question answering
|
| 568 |
+
- Instruction following
|
| 569 |
+
|
| 570 |
+
### Vision Tasks
|
| 571 |
+
- Image captioning and description
|
| 572 |
+
- Visual question answering
|
| 573 |
+
- OCR and text extraction
|
| 574 |
+
- Chart and graph analysis
|
| 575 |
+
- Diagram interpretation
|
| 576 |
+
- Screenshot analysis
|
| 577 |
+
- Document understanding
|
| 578 |
+
- Visual reasoning
|
| 579 |
+
- Object detection and counting
|
| 580 |
+
- Scene understanding
|
| 581 |
+
|
| 582 |
+
### Tool Use and Integration
|
| 583 |
+
- API integration
|
| 584 |
+
- Function calling
|
| 585 |
+
- Database query generation
|
| 586 |
+
- Web search integration
|
| 587 |
+
- Calculator and computations
|
| 588 |
+
- File system operations
|
| 589 |
+
- Multi-tool workflows
|
| 590 |
+
- Structured data generation
|
| 591 |
+
|
| 592 |
+
### Advanced Applications
|
| 593 |
+
- RAG systems
|
| 594 |
+
- Multi-modal chatbots
|
| 595 |
+
- Code assistants
|
| 596 |
+
- Research assistants
|
| 597 |
+
- Document analysis tools
|
| 598 |
+
- Data analysis platforms
|
| 599 |
+
- Educational tools
|
| 600 |
+
- Creative tools
|
| 601 |
|
| 602 |
## Limitations
|
| 603 |
|
| 604 |
- The model may occasionally generate plausible-sounding but incorrect information
|
| 605 |
- Performance on highly specialized technical domains may vary
|
| 606 |
- Very long contexts (150K+ tokens) may require substantial VRAM
|
| 607 |
+
- Image understanding works best with clear, well-lit images
|
| 608 |
- The model is primarily optimized for English, with varying performance on other languages
|
| 609 |
+
- Function calling requires well-structured prompts and tool definitions
|
| 610 |
- Not suitable for real-time applications requiring sub-second latency without optimization
|
| 611 |
+
- Vision capabilities are optimized for static images, not video
|
| 612 |
+
- Tool execution requires external implementation of actual tool functions
|
| 613 |
|
| 614 |
## Ethical Considerations
|
| 615 |
|
|
|
|
| 618 |
- The model should not be used for generating harmful, illegal, or unethical content
|
| 619 |
- Outputs should be reviewed for accuracy in high-stakes applications
|
| 620 |
- The model may reflect biases present in training data despite mitigation efforts
|
| 621 |
+
- Vision capabilities should not be used for surveillance or privacy-invasive applications
|
| 622 |
- Users are responsible for ensuring appropriate use cases and output validation
|
| 623 |
+
- Function calling should be implemented with proper security measures
|
| 624 |
+
- Image analysis may not be 100% accurate and should be verified for critical applications
|
| 625 |
|
| 626 |
## Citation
|
| 627 |
|
|
|
|
| 629 |
|
| 630 |
```bibtex
|
| 631 |
@misc{helion-v2-thinking,
|
| 632 |
+
title={Helion-V2.0-Thinking: A 10.2B Parameter Multimodal Language Model with Extended Context, Vision, and Tool Use},
|
| 633 |
author={DeepXR},
|
| 634 |
year={2025},
|
| 635 |
publisher={Hugging Face},
|
|
|
|
| 643 |
|
| 644 |
## Acknowledgments
|
| 645 |
|
| 646 |
+
We thank the open-source community for their contributions to the development of language models and the tools that made this work possible. Special thanks to the Hugging Face team for their excellent libraries and infrastructure.
|