propella-1 is a family of small multilingual LLMs that annotate text documents across six categories: core content, classification, quality & value, audience & purpose, safety & compliance, and geographic relevance. The annotations can be used to filter, select, and curate LLM training data at scale.
Disclaimer: This is a research project, not an official ellamind product. For production-ready evaluation solutions, check out elluminate.
Highlights
- Annotate 18 properties: Covers well-established dimensions like content quality and educational value, plus underexplored ones like reasoning indicators and time-sensitivity.
- Fast & accurate: Small models (0.6B, 1.7B, 4B) that punch above their weight. Trained in fp8, ready for high-throughput inference.
- Any text, any format: Handles web pages, PDFs, code, math, post-training data and more.
- Highly multilingual: Supports 57 languages.
The propella-1 family of models
| Model | Parameters | Performance | Docs/s (A100/H100) |
|---|---|---|---|
| propella-1-4b | 4B | 0.779 | 10.3 / 27.0 |
| propella-1-1.7b | 1.7B | 0.737 | 17.8 / 39.1 |
| propella-1-0.6b | 0.6B | 0.729 | 21.5 / 39.9 |
Properties
propella-1 models evaluate documents across 18 properties organized into six categories:
| Category | Property | Short Description |
|---|---|---|
| Core Content | Content Integrity | Completeness and technical quality of the content |
| Content Ratio | Proportion of content vs. navigation/UI elements | |
| Content Length | Amount of substantive content | |
| Classification | One-Sentence Description | Ultra-short neutral description of the document |
| Content Type | Functional structure and purpose | |
| Business Sector | Industry domain relevance | |
| Technical Content | Type and intensity of specialized knowledge | |
| Quality & Value | Content Quality | Overall writing and presentation quality |
| Information Density | Ratio of valuable information to redundancy | |
| Educational Value | Potential for teaching and learning | |
| Reasoning Indicators | Presence of logical reasoning and analysis | |
| Audience & Purpose | Audience Level | Target sophistication level |
| Commercial Bias | Commercial influence on objectivity | |
| Time-Sensitivity | How content value changes over time | |
| Safety & Compliance | Content Safety | Presence of inappropriate or harmful content |
| PII Presence | Contains personally identifiable information | |
| Geographic | Regional Relevance | Primary regional/cultural context |
| Country Relevance | Specific country relevance |
Read the property reference for detailed definitions and enum values.
Datasets annotated with propella-1
See openeurollm/propella-annotations.
Input
A text document in any of the 57 supported languages.
The model has a 64k context lengthm but we recommend to truncate documents at 50k characters (see usage).
Output
A JSON object containing annotations. The output strictly conforms to a predefined schema with enumerated values for categorical properties.
Example output
{
"content_integrity": "complete",
"content_ratio": "mostly_content",
"content_length": "moderate",
"one_sentence_description": "Technical documentation explaining how to define and evaluate structured LLM output schemas using elluminate's Python client.",
"content_type": [
"technical_documentation",
"instructional",
"source_code"
],
"business_sector": [
"technology_software"
],
"technical_content": [
"code_heavy"
],
"information_density": "dense",
"content_quality": "excellent",
"audience_level": "advanced",
"commercial_bias": "minimal",
"time_sensitivity": "slowly_changing",
"content_safety": "safe",
"educational_value": "high",
"reasoning_indicators": "explanatory",
"pii_presence": "no_pii",
"regional_relevance": [
"global"
],
"country_relevance": [
"none"
]
}
Usage
See propella.py for prompts and schemas. We recommend enforcing a strict json schema without any whitespace for error-free generation.
Serving
We recommend serving propella models with SGLang and the llguidance structured output backend:
python -m sglang.launch_server \
--model-path outputs/propella-1-4b \
--host 0.0.0.0 \
--port 8000 \
--context-length 65536 \
--max-running-requests 256 \
--chunked-prefill-size 8192 \
--enable-mixed-chunk \
--num-continuous-decode-steps 8 \
--grammar-backend llguidance \
--mem-fraction-static 0.7
fp8 on H100
python -m sglang.launch_server \
--model-path outputs/propella-1-4b \
--quantization w8a8_fp8 \
--kv-cache-dtype fp8_e4m3 \
--host 0.0.0.0 \
--port 8000 \
--context-length 65536 \
--max-running-requests 256 \
--chunked-prefill-size 8192 \
--enable-mixed-chunk \
--num-continuous-decode-steps 8 \
--grammar-backend llguidance \
--mem-fraction-static 0.7
For single-node multi-GPU we recommend increasing data-parallel-size.
For large scale offline inference on SLURM clusters we use inference-hive.
Sending request via OpenAI SDK
from openai import OpenAI
from propella import (
create_messages,
AnnotationResponse,
get_annotation_response_schema,
)
document = "Hi, its me Max."
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="ellamind/propella-1-4b",
messages=create_messages(document),
response_format={
"type": "json_schema",
"json_schema": {
"name": "AnnotationResponse",
"schema": get_annotation_response_schema(flatten=True, compact_whitespace=True),
"strict": True,
}
},
)
response_content = response.choices[0].message.content
result = AnnotationResponse.model_validate_json(response_content)
print(result.model_dump_json(indent=4))
Result
{
"content_integrity": "complete",
"content_ratio": "complete_content",
"content_length": "minimal",
"one_sentence_description": "A short personal greeting introducing someone named Max.",
"content_type": [
"conversational"
],
"business_sector": [
"general_interest"
],
"technical_content": [
"non_technical"
],
"information_density": "dense",
"content_quality": "good",
"audience_level": "general",
"commercial_bias": "none",
"time_sensitivity": "evergreen",
"content_safety": "safe",
"educational_value": "none",
"reasoning_indicators": "none",
"pii_presence": "contains_pii",
"regional_relevance": [
"culturally_neutral"
],
"country_relevance": [
"none"
]
}
Throughput
The throughput results below provide a rough estimate for GPU-hours required to annotate 1M documents. After a short warmup, we run inference for 5k documents, sending 1k concurrent requests to the SGLang server.
| Model | GPU | Docs/s | hours-per-1M docs | Prompt TPS | Output TPS | Total TPS |
|---|---|---|---|---|---|---|
| propella-1-4b | A100 80GB | 10.3 | 27.0 | 19.1k | 1.5k | 20.5k |
| propella-1-4b | H100 96GB | 22.4 | 12.4 | 41.6k | 3.2k | 44.8k |
| propella-1-4b (fp8) | H100 96GB | 27.0 | 10.3 | 50.1k | 3.9k | 54.0k |
| propella-1-1.7b | A100 80GB | 17.8 | 15.6 | 33.0k | 2.6k | 35.6k |
| propella-1-1.7b | H100 96GB | 35.8 | 7.8 | 66.5k | 5.2k | 71.8k |
| propella-1-1.7b (fp8) | H100 96GB | 39.1 | 7.1 | 72.7k | 5.7k | 78.4k |
| propella-1-0.6b | H100 96GB | 39.9 | 7.0 | 74.2k | 5.7k | 79.9k |
| propella-1-0.6b | A100 80GB | 21.5 | 12.9 | 40.0k | 3.1k | 43.1k |
Evaluation
We evaluate the propella-1 models on a test set containing 3k documents. For these documents we obtain annotations from Gemini-3-Pro (reasoning_effort: high), which we consider as groundtruth labels under the assumption that they represent the upper limit in terms of annotation quality.
All baseline models use the detailed annotator system- and user-prompts as defined in propella.py. For throughput reasons, the propella-1 models use a very short, propella-1 specific prompt. We also tested some baseline models with the propella-1 prompt, always leading to worse performance as the prompt lacks details.
Metrics by Property Type
Properties are grouped into three categories, each evaluated with an appropriate metric:
- Ordinal Properties (11 properties): QWK (Quadratic Weighted Kappa), which measures agreement while accounting for the ordinal nature of labels. It penalizes larger disagreements more heavily.
- Binary Properties (1 property): F1, the harmonic mean of precision and recall.
- Multi-select Properties (5 properties): IoU (Jaccard Index), intersection-over-union averaged across samples.
- Free-text Properties (1 property): The
one_sentence_descriptionproperty is excluded from quantitative evaluation.
Overall Score
The overall score is a weighted average of the primary metric for each property type:
overall = (11/17 × avg_QWK) + (1/17 × avg_F1) + (5/17 × avg_IoU)
Performance with fp8
propella-1 models are trained with fp8 precision and work well in both bf16 and fp8 inference modes. The plot below compares annotation quality across precisions for the 4b and 1.7b models. For the 0.6b model we recommend bf16 precision.
| model | bf16 score | fp8 score | diff |
|---|---|---|---|
| propella-1-4b | 0.780 | 0.783 | +0.38% |
| propella-1-1.7b | 0.737 | 0.731 | -0.81% |
Language Support
The training data for propella-1 contains documents in 57 languages:
| lang_script | percent |
|---|---|
| eng_Latn | 35.08 |
| spa_Latn | 3.98 |
| ita_Latn | 3.97 |
| fra_Latn | 3.95 |
| deu_Latn | 3.86 |
| pol_Latn | 3.81 |
| code | 2.82 |
| math | 2.77 |
| sft | 2.41 |
| ukr_Cyrl | 0.95 |
| nld_Latn | 0.95 |
| tha_Thai | 0.95 |
| jpn_Jpan | 0.94 |
| heb_Hebr | 0.94 |
| ell_Grek | 0.93 |
| kor_Hang | 0.93 |
| isl_Latn | 0.93 |
| dan_Latn | 0.92 |
| cat_Latn | 0.92 |
| slk_Latn | 0.92 |
| rus_Cyrl | 0.91 |
| kat_Geor | 0.9 |
| por_Latn | 0.9 |
| ben_Beng | 0.9 |
| fas_Arab | 0.89 |
| ekk_Latn | 0.89 |
| fin_Latn | 0.89 |
| tur_Latn | 0.89 |
| swe_Latn | 0.88 |
| ind_Latn | 0.88 |
| ces_Latn | 0.88 |
| lit_Latn | 0.88 |
| slv_Latn | 0.87 |
| vie_Latn | 0.87 |
| eus_Latn | 0.87 |
| bul_Cyrl | 0.86 |
| mlt_Latn | 0.86 |
| lvs_Latn | 0.86 |
| nob_Latn | 0.86 |
| hun_Latn | 0.85 |
| urd_Arab | 0.85 |
| ron_Latn | 0.84 |
| glg_Latn | 0.83 |
| gle_Latn | 0.83 |
| nno_Latn | 0.83 |
| ltg_Latn | 0.77 |
| yue_Hant | 0.49 |
| cmn_Hant | 0.48 |
| hrv_Latn | 0.43 |
| arb_Arab | 0.39 |
| bos_Latn | 0.39 |
| mkd_Cyrl | 0.39 |
| srp_Latn | 0.37 |
| cmn_Hani | 0.37 |
| hin_Deva | 0.36 |
| srp_Cyrl | 0.36 |
| als_Latn | 0.35 |
| sqi_Latn | 0.03 |
| est_Latn | 0.02 |
| nor_Latn | 0.02 |
| lav_Latn | 0.02 |
| swa_Latn | 0.02 |
Acknowledgements
- This project is supported by the OpenEuroLLM project, co-funded by the Digital Europe Programme under GA no. 101195233. For more information see openeurollm.eu.
- This project is supported by the LLMs4EU project, co-funded by the Digital Europe Programme under GA no. 101198470. For more information see LLMs4EU website.
- This project is supported by the German Federal Ministry for Economic Affairs and Energy (BMWE) under the soofi (Sovereign Open Source Foundation Models for European Intelligence) project.
- We acknowledge the EuroHPC Joint Undertaking for supporting this project through access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium, through an EuroHPC AI Factory Large Scale Access call.
- We thank the AI Service Center for Sensitive and Critical Infrastructures (KISSKI), hosted by GWDG, for additional compute access.

- Downloads last month
- 9