EVE-Instruct

EVE-Instruct is a fine-tuned version of Mistral-Small-3.2-24B-Instruct-2506 expert in the field of the Earth Intelligence with particular emphasis on the Earth Observation EO and the Earth Science (ES) domains. The model improves the EI capabilities of Mistral-Small-3.2-24B-Instruct-2506 while maintaining its general capabilities.

For complete technical specifications, compliance documentation, and transparency disclosures in accordance with EU AI Act Article 50, refer to the Technical Documentation for EVE.

Training

EVE-Instruct is fine-tuned from Mistral Small 3.2 (24B parameters, 128k context) using a strategy that interleaves instruction fine-tuning (IFT) and long-form text, each mixing general-domain replay data with synthetic EO and Earth Sciences content. This approach enables domain adaptation while preserving general interactive capabilities such as instruction following, conversational stability, and tool use.

Fine-tuning Data

Fine-tuning data consists of two components: long-form text and instruction-formatted text, totaling approximately 33.5B tokens in the final training mixture. Due to licensing conditions of source materials, we publicly release a curated subset of 10.7B tokens (20.9M input, 60.1M output, and 10.6B context tokens) of the full dataset used for training.

Long-form text (30% of training mix) combines general-domain replay data with EO and Earth Sciences text, sourced from:

  • Raw corpus samples and high-quality filtered chunks
  • Synthetically generated text via an Active Reading pipeline, which reorganizes salient content to concentrate factual information and reinforce terminology

Instruction-formatted text (70% of training mix) combines general-domain replay data with EO and Earth Sciences instruction–response pairs, including:

  • Contextual and non-contextual Question Answering (ContextQA, SelfQA)
  • Long and multi-document QA (LongContextQA)
  • Multi-hop QA
  • Self-referential alignment prompts (role, developer, and capability specification)

Generation uses a mix of high-quality models including Mistral Large 3, Mistral Medium 3.1, GPT-4o Mini, Qwen3-235B, DeepSeek-R1, DeepSeek v3.1, and Qwen2.5-72B.

Quality control is performed by LLM-based judges assessing domain relevance, factual quality, and grounding. From approximately 21B tokens of synthetic data generated, a filtered subset forms the synthetic pool used in the final training mixture.

Fine-tuning Strategy

During fine-tuning, we vary (i) the ratio of long-form versus instruction-formatted text and (ii) the proportion of general-domain replay versus domain-specific data, to balance factual integration and alignment stability. We use a learning rate schedule intermediate between typical IFT and continual pretraining settings. To combine complementary behaviors, we merge checkpoints from ten training runs with different data mixtures using uniform parameter interpolation.

Alignment

As a final stage, we apply Online Direct Preference Optimization (Online DPO) to refine formatting, stylistic consistency, and preference adherence, following the same alignment recipe and preference training data as Ministral. This preserves domain knowledge acquired during fine-tuning while improving formatting consistency.

Benchmark Results - Domain-Specific

We compare EVE-Instruct to Mistral-Small-3.1-24B-Instruct-2506 and other models in the same range size.

Model Size (B) MCQA Multiple MCQA Single Hallucination Open-Ended Open-Ended w/ Context
IoU Acc. Acc. F1 Judge EVE WR Judge EVE WR
Llama4 Scout 109-A17 80.32 71.23 91.67 66.08 87.37 53.95 71.73 58.31
Qwen3 30-A3 78.40 66.36 93.02 81.30 94.92 50.70 81.81 52.12
Gemma3 27 73.60 57.54 87.31 75.07 94.41 50.92 78.31 51.58
Mistral Small 3.2 24 80.19 70.30 83.51 82.19 91.78 51.69 71.93 57.27
EVE-Instruct 24 86.12 77.73 96.35 84.70 96.40 78.28

Model performance across EO and Earth Sciences benchmark tasks (0-shot). EVE WR (win rate) indicates percentage of pairwise comparisons where EVE-Instruct is preferred over the comparison model (>50% means EVE is preferred).

Benchmark Results - General Capabilities

For general capabilities, we compare EVE-Instruct to Mistral-Small-3.1-24B-Instruct-2506 to show that the general capabilities of the model are maintained or improved.

Category Small 3.2 EVE-Instruct Δ
Math & Reasoning 50.8 54.9 +4.1
Coding 55.6 56.5 +0.9
Knowledge 67.7 69.0 +1.3
Tool Calling 87.9 90.9 +3.0
Instruction Following 80.1 81.2 +1.1
Chat Quality 90.8 91.7 +0.9
Overall 72.2 74.0 +1.8

Table: General-domain performance after domain adaptation (category-level averages over several standard benchmarks, 0–100 scale).

Usage

The model can be used with the following frameworks;

vLLM (recommended)

We recommend using this model with vLLM.

Installation

Make sure to install vLLM >= 0.9.1:

pip install vllm --upgrade

Doing so should automatically install mistral_common >= 1.6.2.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Serve

We recommend that you use EVE-Instruct in a server/client setting.

  1. Spin up a server:
vllm serve eve-esa/EVE-Instruct \
  --tokenizer_mode mistral --config_format mistral \
  --load_format mistral --tool-call-parser mistral \
  --enable-auto-tool-choice --tensor-parallel-size 2

Note: Running EVE-Instruct on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

  1. To ping the client you can use a simple Python snippet. See the following examples.

Transformers

You can also use EVE-Instruct with Transformers !

To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.6.2 to use our tokenizer.

pip install mistral-common --upgrade

Then load our tokenizer along with the model and generate:

Python snippet
import torch

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import Mistral3ForConditionalGeneration

model_id = "eve-esa/EVE-Instruct"

tokenizer = MistralTokenizer.from_hf_hub(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
)

messages = [
    {"role": "system", "content": "You are a helpful Earth Intelligence assistant specializing in Earth Observation and Earth Science."},
    {"role": "user", "content": "What is the normalized difference vegetation index (NDVI) and how is it used in remote sensing?"},
]

tokenized = tokenizer.encode_chat_completion(ChatCompletionRequest(messages=messages))

input_ids = torch.tensor([tokenized.tokens])
attention_mask = torch.ones_like(input_ids)

output = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_new_tokens=1000,
)[0]

decoded_output = tokenizer.decode(output[len(tokenized.tokens) :])
print(decoded_output)
# NDVI (Normalized Difference Vegetation Index) is a widely used remote sensing index
# that quantifies vegetation greenness and density. It is calculated as:
#
#   NDVI = (NIR - Red) / (NIR + Red)
#
# where NIR is the near-infrared reflectance and Red is the red-band reflectance.
# Values range from -1 to +1, with higher values indicating denser, healthier vegetation.
# It is commonly used for crop monitoring, deforestation detection, and land cover mapping.

Funding

This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.

Citation

If you use this project in academic or research settings, please cite:

Downloads last month
159
Safetensors
Model size
24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for eve-esa/EVE-Instruct