Update README.md

debbdee verified 4 months ago

6.74 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceFW/fineweb-2
	model-index:
	- name: DragonLLM/Dragon-3B-Base-alpha
	results:

	- task:
	type: multiple-choice-qa
	name: ARC Challenge
	dataset:
	type: ai2_arc
	name: AI2 ARC (Challenge)
	config: ARC-Challenge
	split: test
	metrics:
	- type: accuracy
	name: Test accuracy
	value: 50.00

	- task:
	type: multiple-choice-qa
	name: ARC Easy
	dataset:
	type: ai2_arc
	name: AI2 ARC (Easy)
	config: ARC-Easy
	split: test
	metrics:
	- type: accuracy
	name: Test accuracy
	value: 76.01

	- task:
	type: commonsense-reasoning
	name: HellaSwag
	dataset:
	type: hellaswag
	name: HellaSwag
	split: validation
	metrics:
	- type: accuracy
	name: Acc
	value: 71.73

	- task:
	type: language-modeling
	name: LAMBADA (word prediction)
	dataset:
	type: lambada
	name: LAMBADA
	split: test
	metrics:
	- type: accuracy
	name: Acc
	value: 65.03

	- task:
	type: commonsense-reasoning
	name: PIQA
	dataset:
	type: piqa
	name: PIQA
	split: validation
	metrics:
	- type: accuracy
	name: Acc
	value: 79.11

	- task:
	type: information-extraction
	name: SWDE
	dataset:
	type: swde
	name: SWDE
	split: test
	metrics:
	- type: accuracy
	name: Acc
	value: 89.92

	- task:
	type: classification
	name: FDA
	dataset:
	type: fda
	name: FDA
	split: test
	metrics:
	- type: accuracy
	name: Acc
	value: 81.13

	---
	## Highlights

	Dragon LLM introduces its new LLM Architecture. Built on a new hybrid GDN -Transformer that outperforms traditional architectures, it can power frugal, sovereign models that can be rapidly specialized on business data and use cases.

	Dragon Architecture features :
	- Very strong ability to remember past words in the sequence compared to other Hybrid approach, inspired by Hymba (NVIDIA)
	- Ability to be used simultaneously by more users on equivalent hardware and better throughput on long-context scenario
	- Extremely efficient learning
	It has been been validated at large scale by the training of a 3B model on 3.5T tokens. It achieves comparable performance against smolLM-3B-Base and Qwen3-4B-Base on ARC, HellaSwag, LAMBADA, and PIQA, while trained on 3-5 time less data.

	Why is this important?
	- Proves performance → same performance with 3–5× less data.
	- Cut cost : more users can be served on the same hardware
	- Ability to deploy in secure environment with constraint on the hardware (even on CPU)
	- Scales better : higher throughput and strong long-context handling (Long documents, files, codes or contracts).


	How has Dragon LLM achieved this?
	• By combining the best recent research papers on LLM architectures, cumulating gains across all processes, from deep layer optimization to attention head or kv cache management.
	• Agile Team able to adapt quickly and test new ideas extremely fast
	• Compute support by the EU Commission (euroHPC - JUPITER and Leonardo HPC)


	What's next?
	The next step is to deliver foundation models for this architecture :
	• a 3B and 7B version of DragonBase trained on 10T+ tokens
	• Chat version of these models
	• Specialized versions for specific industry vertical such as Finance

	If you want to know more and get updates on the project, follow us !

	If you would like a comprehensive deep dive on the architecture : [read our blog post](https://open.substack.com/pub/dragonllm/p/inside-dragons-architecture?r=3j0al4&utm_campaign=post&utm_medium=web)

	## Model Overview

	![arch_long_blanc](https://cdn-uploads.huggingface.co/production/uploads/64f0a2ac5b9c8cdb17786783/IRSj0xkdkblSShP5oiLwP.png)


	## Model Benchmark

	\|Benchmarks \|Dragon \|Qwen3-4B \|SmolLM3\|
	\|----\|----\|----\|----\|
	\|ARC Challenge \|50% \|51.28% \|52.56%\|
	\|ARC Easy \|76.01% \|75.97% \|76.81%\|
	\|HellaSwag \|71.73% \|54.46% \|75.2%\|
	\|LAMBADA \|65.03% \|62.62% \|65.05%\|
	\|PIQA \|79.11% \|77.86% \|78.84%\|
	\|SWDE \|89.92% \|91.99% \|88.03%\|
	\|FDA \|81.13% \|86.75% \|76.13%\|
	\|Average \|73.27% \|71.56% \|73.23%\|

	All evaluations are performed using with lm-eval and few shot set to 0.

	## Limitations

	This model is a foundation model, trained on large-scale general-purpose text corpora. It has not been fine-tuned for any specific downstream task. As such:

	It may produce inaccurate or misleading information, particularly for factual or time-sensitive queries.

	It has no understanding of truth or intent and may generate biased, toxic, or harmful content inherited from its training data.

	It is not suitable for direct use in safety-critical or decision-making contexts (e.g., healthcare, finance, law) without additional alignment or validation.

	The model does not perform well on tasks requiring domain-specific expertise, numerical precision, or structured reasoning unless further fine-tuned.

	Long or complex prompts may lead to loss of coherence or hallucinations as context length grows.

	Fine-tuning, prompt-engineering, or evaluation on downstream tasks is recommended before any production use.

	## Quickstart

	Try it with:
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "DragonLLM/Dragon-3B-Base-alpha"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)

	prompt = "Once upon a time, a valiant knight named Segurant set out on a quest to chase a dragon. He was"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**inputs,
	max_new_tokens=512,
	)

	print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
	```

	## Setup

	For better performance on GPU, we recommend using :
	- [flash-linear-attention](https://github.com/fla-org/flash-linear-attention): the Gated DeltaNet Triton kernels
	Install with ```pip install flash-linear-attention```

	If you use NVIDIA GPU, you can improve performance with :
	- [flash-attention](https://github.com/Dao-AILab/flash-attention):
	Install with ```pip install flash-attn --no-build-isolation```

	- [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d): a short convolution is used as part of the Gated DeltaNet layer
	Install with ```pip install causal-conv1d```

	- (optional, recommended only for A100) [flex-head-ha](https://github.com/xiayuqing0622/flex_head_fa): computing attention with different head dimensions for qk and vo, used for differential attention
	Install with ```pip install flex-head-fa --no-build-isolation```