---
library_name: vllm
language:
- en
license: apache-2.0
inference: false
base_model:
- mistralai/Mistral-Small-3.1-24B-Instruct-2506
tags:
- mistral-common
---
# EVE-Instruct
EVE-Instruct is a fine-tuned version of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506) expert in the field of the Earth Intelligence with particular emphasis on the Earth Observation EO and the Earth Science (ES) domains.
The model improves the EI capabilities of Mistral-Small-3.2-24B-Instruct-2506 while maintaining its general capabilities.
For complete technical specifications, compliance documentation, and transparency disclosures in accordance with EU AI Act Article 50, refer to the [Technical Documentation for EVE](Technical%20Documentation%20for%20EVE.md).
## Training
EVE-Instruct is fine-tuned from Mistral Small 3.2 (24B parameters, 128k context) using a strategy that interleaves instruction fine-tuning (IFT) and long-form text, each mixing general-domain replay data with synthetic EO and Earth Sciences content. This approach enables domain adaptation while preserving general interactive capabilities such as instruction following, conversational stability, and tool use.
### Fine-tuning Data
Fine-tuning data consists of two components: **long-form text** and **instruction-formatted text**, totaling approximately 33.5B tokens in the final training mixture. Due to licensing conditions of source materials, we publicly release a curated subset of 10.7B tokens (20.9M input, 60.1M output, and 10.6B context tokens) of the full dataset used for training.
**Long-form text (30% of training mix)** combines general-domain replay data with EO and Earth Sciences text, sourced from:
- Raw corpus samples and high-quality filtered chunks
- Synthetically generated text via an [Active Reading](https://arxiv.org/abs/2504.06832) pipeline, which reorganizes salient content to concentrate factual information and reinforce terminology
**Instruction-formatted text (70% of training mix)** combines general-domain replay data with EO and Earth Sciences instruction–response pairs, including:
- Contextual and non-contextual Question Answering (ContextQA, SelfQA)
- Long and multi-document QA (LongContextQA)
- Multi-hop QA
- Self-referential alignment prompts (role, developer, and capability specification)
Generation uses a mix of high-quality models including Mistral Large 3, Mistral Medium 3.1, GPT-4o Mini, Qwen3-235B, DeepSeek-R1, DeepSeek v3.1, and Qwen2.5-72B.
**Quality control** is performed by LLM-based judges assessing domain relevance, factual quality, and grounding. From approximately 21B tokens of synthetic data generated, a filtered subset forms the synthetic pool used in the final training mixture.
### Fine-tuning Strategy
During fine-tuning, we vary (i) the ratio of long-form versus instruction-formatted text and (ii) the proportion of general-domain replay versus domain-specific data, to balance factual integration and alignment stability. We use a learning rate schedule intermediate between typical IFT and continual pretraining settings. To combine complementary behaviors, we merge checkpoints from ten training runs with different data mixtures using uniform parameter interpolation.
### Alignment
As a final stage, we apply Online Direct Preference Optimization (Online DPO) to refine formatting, stylistic consistency, and preference adherence, following the same alignment recipe and preference training data as Ministral. This preserves domain knowledge acquired during fine-tuning while improving formatting consistency.
## Benchmark Results - Domain-Specific
We compare EVE-Instruct to [Mistral-Small-3.1-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2506) and other models in the same range size.
| Model |
Size (B) |
MCQA Multiple |
MCQA Single |
Hallucination |
Open-Ended |
Open-Ended w/ Context |
| IoU |
Acc. |
Acc. |
F1 |
Judge |
EVE WR |
Judge |
EVE WR |
| Llama4 Scout |
109-A17 |
80.32 |
71.23 |
91.67 |
66.08 |
87.37 |
53.95 |
71.73 |
58.31 |
| Qwen3 |
30-A3 |
78.40 |
66.36 |
93.02 |
81.30 |
94.92 |
50.70 |
81.81 |
52.12 |
| Gemma3 |
27 |
73.60 |
57.54 |
87.31 |
75.07 |
94.41 |
50.92 |
78.31 |
51.58 |
| Mistral Small 3.2 |
24 |
80.19 |
70.30 |
83.51 |
82.19 |
91.78 |
51.69 |
71.93 |
57.27 |
EVE-Instruct |
24 |
86.12 |
77.73 |
96.35 |
84.70 |
96.40 |
— |
78.28 |
— |
Model performance across EO and Earth Sciences benchmark tasks (0-shot). EVE WR (win rate) indicates percentage of pairwise comparisons where EVE-Instruct is preferred over the comparison model (>50% means EVE is preferred).
## Benchmark Results - General Capabilities
For general capabilities, we compare EVE-Instruct to [Mistral-Small-3.1-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2506) to show that the general capabilities of the model are maintained or improved.
| Category | Small 3.2 | EVE-Instruct | Δ |
|---|---:|---:|---:|
| Math & Reasoning | 50.8 | 54.9 | +4.1 |
| Coding | 55.6 | 56.5 | +0.9 |
| Knowledge | 67.7 | 69.0 | +1.3 |
| Tool Calling | 87.9 | 90.9 | +3.0 |
| Instruction Following | 80.1 | 81.2 | +1.1 |
| Chat Quality | 90.8 | 91.7 | +0.9 |
| **Overall** | **72.2** | **74.0** | **+1.8** |
**Table:** General-domain performance after domain adaptation (category-level averages over several standard benchmarks, 0–100 scale).
## Usage
The model can be used with the following frameworks;
- [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
- [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
### vLLM (recommended)
We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
#### Installation
Make sure to install [`vLLM >= 0.9.1`](https://github.com/vllm-project/vllm/releases/tag/v0.9.1):
```
pip install vllm --upgrade
```
Doing so should automatically install [`mistral_common >= 1.6.2`](https://github.com/mistralai/mistral-common/releases/tag/v1.6.2).
To check:
```
python -c "import mistral_common; print(mistral_common.__version__)"
```
You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest/images/sha256-de9032a92ffea7b5c007dad80b38fd44aac11eddc31c435f8e52f3b7404bbf39).
#### Serve
We recommend that you use EVE-Instruct in a server/client setting.
1. Spin up a server:
```
vllm serve eve-esa/EVE-Instruct \
--tokenizer_mode mistral --config_format mistral \
--load_format mistral --tool-call-parser mistral \
--enable-auto-tool-choice --tensor-parallel-size 2
```
**Note:** Running EVE-Instruct on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
2. To ping the client you can use a simple Python snippet. See the following examples.
### Transformers
You can also use EVE-Instruct with `Transformers` !
To make the best use of our model with `Transformers` make sure to have [installed](https://github.com/mistralai/mistral-common) `mistral-common >= 1.6.2` to use our tokenizer.
```bash
pip install mistral-common --upgrade
```
Then load our tokenizer along with the model and generate:
Python snippet
```python
import torch
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import Mistral3ForConditionalGeneration
model_id = "eve-esa/EVE-Instruct"
tokenizer = MistralTokenizer.from_hf_hub(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16
)
messages = [
{"role": "system", "content": "You are a helpful Earth Intelligence assistant specializing in Earth Observation and Earth Science."},
{"role": "user", "content": "What is the normalized difference vegetation index (NDVI) and how is it used in remote sensing?"},
]
tokenized = tokenizer.encode_chat_completion(ChatCompletionRequest(messages=messages))
input_ids = torch.tensor([tokenized.tokens])
attention_mask = torch.ones_like(input_ids)
output = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1000,
)[0]
decoded_output = tokenizer.decode(output[len(tokenized.tokens) :])
print(decoded_output)
# NDVI (Normalized Difference Vegetation Index) is a widely used remote sensing index
# that quantifies vegetation greenness and density. It is calculated as:
#
# NDVI = (NIR - Red) / (NIR + Red)
#
# where NIR is the near-infrared reflectance and Red is the red-band reflectance.
# Values range from -1 to +1, with higher values indicating denser, healthier vegetation.
# It is commonly used for crop monitoring, deforestation detection, and land cover mapping.
```
## Funding
This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.
## Citation
If you use this project in academic or research settings, please cite: