|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-2 |
|
|
model-index: |
|
|
- name: DragonLLM/Dragon-3B-Base-alpha |
|
|
results: |
|
|
|
|
|
- task: |
|
|
type: multiple-choice-qa |
|
|
name: ARC Challenge |
|
|
dataset: |
|
|
type: ai2_arc |
|
|
name: AI2 ARC (Challenge) |
|
|
config: ARC-Challenge |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Test accuracy |
|
|
value: 50.00 |
|
|
|
|
|
- task: |
|
|
type: multiple-choice-qa |
|
|
name: ARC Easy |
|
|
dataset: |
|
|
type: ai2_arc |
|
|
name: AI2 ARC (Easy) |
|
|
config: ARC-Easy |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Test accuracy |
|
|
value: 76.01 |
|
|
|
|
|
- task: |
|
|
type: commonsense-reasoning |
|
|
name: HellaSwag |
|
|
dataset: |
|
|
type: hellaswag |
|
|
name: HellaSwag |
|
|
split: validation |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Acc |
|
|
value: 71.73 |
|
|
|
|
|
- task: |
|
|
type: language-modeling |
|
|
name: LAMBADA (word prediction) |
|
|
dataset: |
|
|
type: lambada |
|
|
name: LAMBADA |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Acc |
|
|
value: 65.03 |
|
|
|
|
|
- task: |
|
|
type: commonsense-reasoning |
|
|
name: PIQA |
|
|
dataset: |
|
|
type: piqa |
|
|
name: PIQA |
|
|
split: validation |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Acc |
|
|
value: 79.11 |
|
|
|
|
|
- task: |
|
|
type: information-extraction |
|
|
name: SWDE |
|
|
dataset: |
|
|
type: swde |
|
|
name: SWDE |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Acc |
|
|
value: 89.92 |
|
|
|
|
|
- task: |
|
|
type: classification |
|
|
name: FDA |
|
|
dataset: |
|
|
type: fda |
|
|
name: FDA |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: Acc |
|
|
value: 81.13 |
|
|
|
|
|
--- |
|
|
## Highlights |
|
|
|
|
|
Dragon LLM introduces its new LLM Architecture. Built on a new hybrid GDN -Transformer that outperforms traditional architectures, it can power frugal, sovereign models that can be rapidly specialized on business data and use cases. |
|
|
|
|
|
Dragon Architecture features : |
|
|
- Very strong ability to remember past words in the sequence compared to other Hybrid approach, inspired by Hymba (NVIDIA) |
|
|
- Ability to be used simultaneously by more users on equivalent hardware and better throughput on long-context scenario |
|
|
- Extremely efficient learning |
|
|
It has been been validated at large scale by the training of a 3B model on 3.5T tokens. It achieves comparable performance against smolLM-3B-Base and Qwen3-4B-Base on ARC, HellaSwag, LAMBADA, and PIQA, while trained on 3-5 time less data. |
|
|
|
|
|
Why is this important? |
|
|
- **Proves performance** → same performance with 3–5× less data. |
|
|
- **Cut cost** : more users can be served on the same hardware |
|
|
- Ability to deploy in secure environment with constraint on the hardware (even on CPU) |
|
|
- **Scales better** : higher throughput and strong long-context handling (Long documents, files, codes or contracts). |
|
|
|
|
|
|
|
|
How has Dragon LLM achieved this? |
|
|
• By combining the best recent research papers on LLM architectures, cumulating gains across all processes, from deep layer optimization to attention head or kv cache management. |
|
|
• Agile Team able to adapt quickly and test new ideas extremely fast |
|
|
• Compute support by the EU Commission (euroHPC - JUPITER and Leonardo HPC) |
|
|
|
|
|
|
|
|
What's next? |
|
|
The next step is to deliver foundation models for this architecture : |
|
|
• a 3B and 7B version of DragonBase trained on 10T+ tokens |
|
|
• Chat version of these models |
|
|
• Specialized versions for specific industry vertical such as Finance |
|
|
|
|
|
If you want to know more and get updates on the project, follow us ! |
|
|
|
|
|
If you would like a comprehensive deep dive on the architecture : [read our blog post](https://open.substack.com/pub/dragonllm/p/inside-dragons-architecture?r=3j0al4&utm_campaign=post&utm_medium=web) |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
## Model Benchmark |
|
|
|
|
|
|Benchmarks |Dragon |Qwen3-4B |SmolLM3| |
|
|
|----|----|----|----| |
|
|
|ARC Challenge |50% |51.28% |**52.56%**| |
|
|
|ARC Easy |76.01% |75.97% |**76.81%**| |
|
|
|HellaSwag |71.73% |54.46% |**75.2%**| |
|
|
|LAMBADA |65.03% |62.62% |**65.05%**| |
|
|
|PIQA |**79.11%** |77.86% |78.84%| |
|
|
|SWDE |89.92% |**91.99%** |88.03%| |
|
|
|FDA |81.13% |**86.75%** |76.13%| |
|
|
|Average |**73.27%** |71.56% |73.23%| |
|
|
|
|
|
All evaluations are performed using with lm-eval and few shot set to 0. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
This model is a foundation model, trained on large-scale general-purpose text corpora. It has not been fine-tuned for any specific downstream task. As such: |
|
|
|
|
|
It may produce inaccurate or misleading information, particularly for factual or time-sensitive queries. |
|
|
|
|
|
It has no understanding of truth or intent and may generate biased, toxic, or harmful content inherited from its training data. |
|
|
|
|
|
It is not suitable for direct use in safety-critical or decision-making contexts (e.g., healthcare, finance, law) without additional alignment or validation. |
|
|
|
|
|
The model does not perform well on tasks requiring domain-specific expertise, numerical precision, or structured reasoning unless further fine-tuned. |
|
|
|
|
|
Long or complex prompts may lead to loss of coherence or hallucinations as context length grows. |
|
|
|
|
|
Fine-tuning, prompt-engineering, or evaluation on downstream tasks is recommended before any production use. |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
Try it with: |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_name = "DragonLLM/Dragon-3B-Base-alpha" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
dtype="auto", |
|
|
device_map="auto", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
prompt = "Once upon a time, a valiant knight named Segurant set out on a quest to chase a dragon. He was" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
generated_ids = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Setup |
|
|
|
|
|
For better performance on GPU, we recommend using : |
|
|
- [flash-linear-attention](https://github.com/fla-org/flash-linear-attention): the Gated DeltaNet Triton kernels |
|
|
Install with ```pip install flash-linear-attention``` |
|
|
|
|
|
If you use NVIDIA GPU, you can improve performance with : |
|
|
- [flash-attention](https://github.com/Dao-AILab/flash-attention): |
|
|
Install with ```pip install flash-attn --no-build-isolation``` |
|
|
|
|
|
- [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d): a short convolution is used as part of the Gated DeltaNet layer |
|
|
Install with ```pip install causal-conv1d``` |
|
|
|
|
|
- (optional, recommended only for A100) [flex-head-ha](https://github.com/xiayuqing0622/flex_head_fa): computing attention with different head dimensions for qk and vo, used for differential attention |
|
|
Install with ```pip install flex-head-fa --no-build-isolation``` |