propella-1 artwork

propella-1 is a family of small multilingual LLMs that annotate text documents across six categories: core content, classification, quality & value, audience & purpose, safety & compliance, and geographic relevance. The annotations can be used to filter, select, and curate LLM training data at scale.

Disclaimer: This is a research project, not an official ellamind product. For production-ready evaluation solutions, check out elluminate.

Highlights

Annotate 18 properties: Covers well-established dimensions like content quality and educational value, plus underexplored ones like reasoning indicators and time-sensitivity.
Fast & accurate: Small models (0.6B, 1.7B, 4B) that punch above their weight. Trained in fp8, ready for high-throughput inference.
Any text, any format: Handles web pages, PDFs, code, math, post-training data and more.
Highly multilingual: Supports 57 languages.

overall-performance-plot

The propella-1 family of models

Model	Parameters	Performance	Docs/s (A100/H100)
propella-1-4b	4B	0.779	10.3 / 27.0
propella-1-1.7b	1.7B	0.737	17.8 / 39.1
propella-1-0.6b	0.6B	0.729	21.5 / 39.9

Properties

propella-1 models evaluate documents across 18 properties organized into six categories:

Category	Property	Short Description
Core Content	Content Integrity	Completeness and technical quality of the content
	Content Ratio	Proportion of content vs. navigation/UI elements
	Content Length	Amount of substantive content
Classification	One-Sentence Description	Ultra-short neutral description of the document
	Content Type	Functional structure and purpose
	Business Sector	Industry domain relevance
	Technical Content	Type and intensity of specialized knowledge
Quality & Value	Content Quality	Overall writing and presentation quality
	Information Density	Ratio of valuable information to redundancy
	Educational Value	Potential for teaching and learning
	Reasoning Indicators	Presence of logical reasoning and analysis
Audience & Purpose	Audience Level	Target sophistication level
	Commercial Bias	Commercial influence on objectivity
	Time-Sensitivity	How content value changes over time
Safety & Compliance	Content Safety	Presence of inappropriate or harmful content
	PII Presence	Contains personally identifiable information
Geographic	Regional Relevance	Primary regional/cultural context
	Country Relevance	Specific country relevance

Read the property reference for detailed definitions and enum values.

Datasets annotated with propella-1

See openeurollm/propella-annotations.

Input

A text document in any of the 57 supported languages.
The model has a 64k context lengthm but we recommend to truncate documents at 50k characters (see usage).

Output

A JSON object containing annotations. The output strictly conforms to a predefined schema with enumerated values for categorical properties.

Example output

{
  "content_integrity": "complete",
  "content_ratio": "mostly_content",
  "content_length": "moderate",
  "one_sentence_description": "Technical documentation explaining how to define and evaluate structured LLM output schemas using elluminate's Python client.",
  "content_type": [
    "technical_documentation",
    "instructional",
    "source_code"
  ],
  "business_sector": [
    "technology_software"
  ],
  "technical_content": [
    "code_heavy"
  ],
  "information_density": "dense",
  "content_quality": "excellent",
  "audience_level": "advanced",
  "commercial_bias": "minimal",
  "time_sensitivity": "slowly_changing",
  "content_safety": "safe",
  "educational_value": "high",
  "reasoning_indicators": "explanatory",
  "pii_presence": "no_pii",
  "regional_relevance": [
    "global"
  ],
  "country_relevance": [
    "none"
  ]
}

Usage

See propella.py for prompts and schemas. We recommend enforcing a strict json schema without any whitespace for error-free generation.

Serving

We recommend serving propella models with SGLang and the llguidance structured output backend:

python -m sglang.launch_server \
    --model-path outputs/propella-1-4b \
    --host 0.0.0.0 \
    --port 8000 \
    --context-length 65536 \
    --max-running-requests 256 \
    --chunked-prefill-size 8192 \
    --enable-mixed-chunk \
    --num-continuous-decode-steps 8 \
    --grammar-backend llguidance \
    --mem-fraction-static 0.7

fp8 on H100

python -m sglang.launch_server \
    --model-path outputs/propella-1-4b \
    --quantization w8a8_fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --host 0.0.0.0 \
    --port 8000 \
    --context-length 65536 \
    --max-running-requests 256 \
    --chunked-prefill-size 8192 \
    --enable-mixed-chunk \
    --num-continuous-decode-steps 8 \
    --grammar-backend llguidance \
    --mem-fraction-static 0.7

For single-node multi-GPU we recommend increasing data-parallel-size. For large scale offline inference on SLURM clusters we use inference-hive.

Sending request via OpenAI SDK

from openai import OpenAI
from propella import (
    create_messages,
    AnnotationResponse,
    get_annotation_response_schema,
)

document = "Hi, its me Max."

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="ellamind/propella-1-4b",
    messages=create_messages(document),
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "AnnotationResponse",
            "schema": get_annotation_response_schema(flatten=True, compact_whitespace=True),
            "strict": True,
        }
    },
)
response_content = response.choices[0].message.content
result = AnnotationResponse.model_validate_json(response_content)
print(result.model_dump_json(indent=4))

Result

{
    "content_integrity": "complete",
    "content_ratio": "complete_content",
    "content_length": "minimal",
    "one_sentence_description": "A short personal greeting introducing someone named Max.",
    "content_type": [
        "conversational"
    ],
    "business_sector": [
        "general_interest"
    ],
    "technical_content": [
        "non_technical"
    ],
    "information_density": "dense",
    "content_quality": "good",
    "audience_level": "general",
    "commercial_bias": "none",
    "time_sensitivity": "evergreen",
    "content_safety": "safe",
    "educational_value": "none",
    "reasoning_indicators": "none",
    "pii_presence": "contains_pii",
    "regional_relevance": [
        "culturally_neutral"
    ],
    "country_relevance": [
        "none"
    ]
}

Throughput

The throughput results below provide a rough estimate for GPU-hours required to annotate 1M documents. After a short warmup, we run inference for 5k documents, sending 1k concurrent requests to the SGLang server.

Model	GPU	Docs/s	hours-per-1M docs	Prompt TPS	Output TPS	Total TPS
propella-1-4b	A100 80GB	10.3	27.0	19.1k	1.5k	20.5k
propella-1-4b	H100 96GB	22.4	12.4	41.6k	3.2k	44.8k
propella-1-4b (fp8)	H100 96GB	27.0	10.3	50.1k	3.9k	54.0k
propella-1-1.7b	A100 80GB	17.8	15.6	33.0k	2.6k	35.6k
propella-1-1.7b	H100 96GB	35.8	7.8	66.5k	5.2k	71.8k
propella-1-1.7b (fp8)	H100 96GB	39.1	7.1	72.7k	5.7k	78.4k
propella-1-0.6b	H100 96GB	39.9	7.0	74.2k	5.7k	79.9k
propella-1-0.6b	A100 80GB	21.5	12.9	40.0k	3.1k	43.1k

Evaluation

We evaluate the propella-1 models on a test set containing 3k documents. For these documents we obtain annotations from Gemini-3-Pro (reasoning_effort: high), which we consider as groundtruth labels under the assumption that they represent the upper limit in terms of annotation quality.

All baseline models use the detailed annotator system- and user-prompts as defined in propella.py. For throughput reasons, the propella-1 models use a very short, propella-1 specific prompt. We also tested some baseline models with the propella-1 prompt, always leading to worse performance as the prompt lacks details.

Metrics by Property Type

Properties are grouped into three categories, each evaluated with an appropriate metric:

Ordinal Properties (11 properties): QWK (Quadratic Weighted Kappa), which measures agreement while accounting for the ordinal nature of labels. It penalizes larger disagreements more heavily.
Binary Properties (1 property): F1, the harmonic mean of precision and recall.
Multi-select Properties (5 properties): IoU (Jaccard Index), intersection-over-union averaged across samples.
Free-text Properties (1 property): The one_sentence_description property is excluded from quantitative evaluation.

_{Click to view at full size}

Overall Score

The overall score is a weighted average of the primary metric for each property type:

overall = (11/17 × avg_QWK) + (1/17 × avg_F1) + (5/17 × avg_IoU)

Performance with fp8

propella-1 models are trained with fp8 precision and work well in both bf16 and fp8 inference modes. The plot below compares annotation quality across precisions for the 4b and 1.7b models. For the 0.6b model we recommend bf16 precision.

model	bf16 score	fp8 score	diff
propella-1-4b	0.780	0.783	+0.38%
propella-1-1.7b	0.737	0.731	-0.81%

Language Support

The training data for propella-1 contains documents in 57 languages:

lang_script	percent
eng_Latn	35.08
spa_Latn	3.98
ita_Latn	3.97
fra_Latn	3.95
deu_Latn	3.86
pol_Latn	3.81
code	2.82
math	2.77
sft	2.41
ukr_Cyrl	0.95
nld_Latn	0.95
tha_Thai	0.95
jpn_Jpan	0.94
heb_Hebr	0.94
ell_Grek	0.93
kor_Hang	0.93
isl_Latn	0.93
dan_Latn	0.92
cat_Latn	0.92
slk_Latn	0.92
rus_Cyrl	0.91
kat_Geor	0.9
por_Latn	0.9
ben_Beng	0.9
fas_Arab	0.89
ekk_Latn	0.89
fin_Latn	0.89
tur_Latn	0.89
swe_Latn	0.88
ind_Latn	0.88
ces_Latn	0.88
lit_Latn	0.88
slv_Latn	0.87
vie_Latn	0.87
eus_Latn	0.87
bul_Cyrl	0.86
mlt_Latn	0.86
lvs_Latn	0.86
nob_Latn	0.86
hun_Latn	0.85
urd_Arab	0.85
ron_Latn	0.84
glg_Latn	0.83
gle_Latn	0.83
nno_Latn	0.83
ltg_Latn	0.77
yue_Hant	0.49
cmn_Hant	0.48
hrv_Latn	0.43
arb_Arab	0.39
bos_Latn	0.39
mkd_Cyrl	0.39
srp_Latn	0.37
cmn_Hani	0.37
hin_Deva	0.36
srp_Cyrl	0.36
als_Latn	0.35
sqi_Latn	0.03
est_Latn	0.02
nor_Latn	0.02
lav_Latn	0.02
swa_Latn	0.02

Acknowledgements

This project is supported by the OpenEuroLLM project, co-funded by the Digital Europe Programme under GA no. 101195233. For more information see openeurollm.eu.
This project is supported by the LLMs4EU project, co-funded by the Digital Europe Programme under GA no. 101198470. For more information see LLMs4EU website.
This project is supported by the German Federal Ministry for Economic Affairs and Energy (BMWE) under the soofi (Sovereign Open Source Foundation Models for European Intelligence) project.
We acknowledge the EuroHPC Joint Undertaking for supporting this project through access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium, through an EuroHPC AI Factory Large Scale Access call.
We thank the AI Service Center for Sensitive and Critical Infrastructures (KISSKI), hosted by GWDG, for additional compute access.

Downloads last month: 9

Safetensors

Model size

0.8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ellamind/propella-1-0.6b

Quantizations

2 models

Collection including ellamind/propella-1-0.6b

propella-1

Collection

Small multilingual LLMs for annotating and curating LLM training data. • 4 items • Updated 5 days ago • 1