EVE-Instruct
EVE-Instruct is a fine-tuned version of Mistral-Small-3.2-24B-Instruct-2506 expert in the field of the Earth Intelligence with particular emphasis on the Earth Observation EO and the Earth Science (ES) domains. The model improves the EI capabilities of Mistral-Small-3.2-24B-Instruct-2506 while maintaining its general capabilities.
For complete technical specifications, compliance documentation, and transparency disclosures in accordance with EU AI Act Article 50, refer to the Technical Documentation for EVE.
Training
EVE-Instruct is fine-tuned from Mistral Small 3.2 (24B parameters, 128k context) using a strategy that interleaves instruction fine-tuning (IFT) and long-form text, each mixing general-domain replay data with synthetic EO and Earth Sciences content. This approach enables domain adaptation while preserving general interactive capabilities such as instruction following, conversational stability, and tool use.
Fine-tuning Data
Fine-tuning data consists of two components: long-form text and instruction-formatted text, totaling approximately 33.5B tokens in the final training mixture. Due to licensing conditions of source materials, we publicly release a curated subset of 10.7B tokens (20.9M input, 60.1M output, and 10.6B context tokens) of the full dataset used for training.
Long-form text (30% of training mix) combines general-domain replay data with EO and Earth Sciences text, sourced from:
- Raw corpus samples and high-quality filtered chunks
- Synthetically generated text via an Active Reading pipeline, which reorganizes salient content to concentrate factual information and reinforce terminology
Instruction-formatted text (70% of training mix) combines general-domain replay data with EO and Earth Sciences instruction–response pairs, including:
- Contextual and non-contextual Question Answering (ContextQA, SelfQA)
- Long and multi-document QA (LongContextQA)
- Multi-hop QA
- Self-referential alignment prompts (role, developer, and capability specification)
Generation uses a mix of high-quality models including Mistral Large 3, Mistral Medium 3.1, GPT-4o Mini, Qwen3-235B, DeepSeek-R1, DeepSeek v3.1, and Qwen2.5-72B.
Quality control is performed by LLM-based judges assessing domain relevance, factual quality, and grounding. From approximately 21B tokens of synthetic data generated, a filtered subset forms the synthetic pool used in the final training mixture.
Fine-tuning Strategy
During fine-tuning, we vary (i) the ratio of long-form versus instruction-formatted text and (ii) the proportion of general-domain replay versus domain-specific data, to balance factual integration and alignment stability. We use a learning rate schedule intermediate between typical IFT and continual pretraining settings. To combine complementary behaviors, we merge checkpoints from ten training runs with different data mixtures using uniform parameter interpolation.
Alignment
As a final stage, we apply Online Direct Preference Optimization (Online DPO) to refine formatting, stylistic consistency, and preference adherence, following the same alignment recipe and preference training data as Ministral. This preserves domain knowledge acquired during fine-tuning while improving formatting consistency.
Benchmark Results - Domain-Specific
We compare EVE-Instruct to Mistral-Small-3.1-24B-Instruct-2506 and other models in the same range size.
| Model | Size (B) | MCQA Multiple | MCQA Single | Hallucination | Open-Ended | Open-Ended w/ Context | |||
|---|---|---|---|---|---|---|---|---|---|
| IoU | Acc. | Acc. | F1 | Judge | EVE WR | Judge | EVE WR | ||
| Llama4 Scout | 109-A17 | 80.32 | 71.23 | 91.67 | 66.08 | 87.37 | 53.95 | 71.73 | 58.31 |
| Qwen3 | 30-A3 | 78.40 | 66.36 | 93.02 | 81.30 | 94.92 | 50.70 | 81.81 | 52.12 |
| Gemma3 | 27 | 73.60 | 57.54 | 87.31 | 75.07 | 94.41 | 50.92 | 78.31 | 51.58 |
| Mistral Small 3.2 | 24 | 80.19 | 70.30 | 83.51 | 82.19 | 91.78 | 51.69 | 71.93 | 57.27 |
EVE-Instruct |
24 | 86.12 | 77.73 | 96.35 | 84.70 | 96.40 | — | 78.28 | — |
Model performance across EO and Earth Sciences benchmark tasks (0-shot). EVE WR (win rate) indicates percentage of pairwise comparisons where EVE-Instruct is preferred over the comparison model (>50% means EVE is preferred).
Benchmark Results - General Capabilities
For general capabilities, we compare EVE-Instruct to Mistral-Small-3.1-24B-Instruct-2506 to show that the general capabilities of the model are maintained or improved.
| Category | Small 3.2 | EVE-Instruct | Δ |
|---|---|---|---|
| Math & Reasoning | 50.8 | 54.9 | +4.1 |
| Coding | 55.6 | 56.5 | +0.9 |
| Knowledge | 67.7 | 69.0 | +1.3 |
| Tool Calling | 87.9 | 90.9 | +3.0 |
| Instruction Following | 80.1 | 81.2 | +1.1 |
| Chat Quality | 90.8 | 91.7 | +0.9 |
| Overall | 72.2 | 74.0 | +1.8 |
Table: General-domain performance after domain adaptation (category-level averages over several standard benchmarks, 0–100 scale).
Usage
The model can be used with the following frameworks;
vllm (recommended): See heretransformers: See here
vLLM (recommended)
We recommend using this model with vLLM.
Installation
Make sure to install vLLM >= 0.9.1:
pip install vllm --upgrade
Doing so should automatically install mistral_common >= 1.6.2.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve
We recommend that you use EVE-Instruct in a server/client setting.
- Spin up a server:
vllm serve eve-esa/EVE-Instruct \
--tokenizer_mode mistral --config_format mistral \
--load_format mistral --tool-call-parser mistral \
--enable-auto-tool-choice --tensor-parallel-size 2
Note: Running EVE-Instruct on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
- To ping the client you can use a simple Python snippet. See the following examples.
Transformers
You can also use EVE-Instruct with Transformers !
To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.6.2 to use our tokenizer.
pip install mistral-common --upgrade
Then load our tokenizer along with the model and generate:
Python snippet
import torch
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import Mistral3ForConditionalGeneration
model_id = "eve-esa/EVE-Instruct"
tokenizer = MistralTokenizer.from_hf_hub(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16
)
messages = [
{"role": "system", "content": "You are a helpful Earth Intelligence assistant specializing in Earth Observation and Earth Science."},
{"role": "user", "content": "What is the normalized difference vegetation index (NDVI) and how is it used in remote sensing?"},
]
tokenized = tokenizer.encode_chat_completion(ChatCompletionRequest(messages=messages))
input_ids = torch.tensor([tokenized.tokens])
attention_mask = torch.ones_like(input_ids)
output = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1000,
)[0]
decoded_output = tokenizer.decode(output[len(tokenized.tokens) :])
print(decoded_output)
# NDVI (Normalized Difference Vegetation Index) is a widely used remote sensing index
# that quantifies vegetation greenness and density. It is calculated as:
#
# NDVI = (NIR - Red) / (NIR + Red)
#
# where NIR is the near-infrared reflectance and Red is the red-band reflectance.
# Values range from -1 to +1, with higher values indicating denser, healthier vegetation.
# It is commonly used for crop monitoring, deforestation detection, and land cover mapping.
Funding
This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.
Citation
If you use this project in academic or research settings, please cite:
- Downloads last month
- 159