propella-1 artwork

propella-1 is a family of small multilingual LLMs that annotate text documents across six categories: core content, classification, quality & value, audience & purpose, safety & compliance, and geographic relevance. The annotations can be used to filter, select, and curate LLM training data at scale.

Disclaimer: This is a research project, not an official ellamind product. For production-ready evaluation solutions, check out elluminate.

Highlights

  • Annotate 18 properties: Covers well-established dimensions like content quality and educational value, plus underexplored ones like reasoning indicators and time-sensitivity.
  • Fast & accurate: Small models (0.6B, 1.7B, 4B) that punch above their weight. Trained in fp8, ready for high-throughput inference.
  • Any text, any format: Handles web pages, PDFs, code, math, post-training data and more.
  • Highly multilingual: Supports 57 languages.

overall-performance-plot

The propella-1 family of models

Model Parameters Performance Docs/s (A100/H100)
propella-1-4b 4B 0.779 10.3 / 27.0
propella-1-1.7b 1.7B 0.737 17.8 / 39.1
propella-1-0.6b 0.6B 0.729 21.5 / 39.9

Properties

propella-1 models evaluate documents across 18 properties organized into six categories:

Category Property Short Description
Core Content Content Integrity Completeness and technical quality of the content
Content Ratio Proportion of content vs. navigation/UI elements
Content Length Amount of substantive content
Classification One-Sentence Description Ultra-short neutral description of the document
Content Type Functional structure and purpose
Business Sector Industry domain relevance
Technical Content Type and intensity of specialized knowledge
Quality & Value Content Quality Overall writing and presentation quality
Information Density Ratio of valuable information to redundancy
Educational Value Potential for teaching and learning
Reasoning Indicators Presence of logical reasoning and analysis
Audience & Purpose Audience Level Target sophistication level
Commercial Bias Commercial influence on objectivity
Time-Sensitivity How content value changes over time
Safety & Compliance Content Safety Presence of inappropriate or harmful content
PII Presence Contains personally identifiable information
Geographic Regional Relevance Primary regional/cultural context
Country Relevance Specific country relevance

Read the property reference for detailed definitions and enum values.

Datasets annotated with propella-1

See openeurollm/propella-annotations.

Input

A text document in any of the 57 supported languages.
The model has a 64k context lengthm but we recommend to truncate documents at 50k characters (see usage).

Output

A JSON object containing annotations. The output strictly conforms to a predefined schema with enumerated values for categorical properties.

Example output
{
  "content_integrity": "complete",
  "content_ratio": "mostly_content",
  "content_length": "moderate",
  "one_sentence_description": "Technical documentation explaining how to define and evaluate structured LLM output schemas using elluminate's Python client.",
  "content_type": [
    "technical_documentation",
    "instructional",
    "source_code"
  ],
  "business_sector": [
    "technology_software"
  ],
  "technical_content": [
    "code_heavy"
  ],
  "information_density": "dense",
  "content_quality": "excellent",
  "audience_level": "advanced",
  "commercial_bias": "minimal",
  "time_sensitivity": "slowly_changing",
  "content_safety": "safe",
  "educational_value": "high",
  "reasoning_indicators": "explanatory",
  "pii_presence": "no_pii",
  "regional_relevance": [
    "global"
  ],
  "country_relevance": [
    "none"
  ]
}

Usage

See propella.py for prompts and schemas. We recommend enforcing a strict json schema without any whitespace for error-free generation.

Serving

We recommend serving propella models with SGLang and the llguidance structured output backend:

python -m sglang.launch_server \
    --model-path outputs/propella-1-4b \
    --host 0.0.0.0 \
    --port 8000 \
    --context-length 65536 \
    --max-running-requests 256 \
    --chunked-prefill-size 8192 \
    --enable-mixed-chunk \
    --num-continuous-decode-steps 8 \
    --grammar-backend llguidance \
    --mem-fraction-static 0.7
fp8 on H100
python -m sglang.launch_server \
    --model-path outputs/propella-1-4b \
    --quantization w8a8_fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --host 0.0.0.0 \
    --port 8000 \
    --context-length 65536 \
    --max-running-requests 256 \
    --chunked-prefill-size 8192 \
    --enable-mixed-chunk \
    --num-continuous-decode-steps 8 \
    --grammar-backend llguidance \
    --mem-fraction-static 0.7

For single-node multi-GPU we recommend increasing data-parallel-size. For large scale offline inference on SLURM clusters we use inference-hive.

Sending request via OpenAI SDK

from openai import OpenAI
from propella import (
    create_messages,
    AnnotationResponse,
    get_annotation_response_schema,
)

document = "Hi, its me Max."

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="ellamind/propella-1-4b",
    messages=create_messages(document),
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "AnnotationResponse",
            "schema": get_annotation_response_schema(flatten=True, compact_whitespace=True),
            "strict": True,
        }
    },
)
response_content = response.choices[0].message.content
result = AnnotationResponse.model_validate_json(response_content)
print(result.model_dump_json(indent=4))
Result
{
    "content_integrity": "complete",
    "content_ratio": "complete_content",
    "content_length": "minimal",
    "one_sentence_description": "A short personal greeting introducing someone named Max.",
    "content_type": [
        "conversational"
    ],
    "business_sector": [
        "general_interest"
    ],
    "technical_content": [
        "non_technical"
    ],
    "information_density": "dense",
    "content_quality": "good",
    "audience_level": "general",
    "commercial_bias": "none",
    "time_sensitivity": "evergreen",
    "content_safety": "safe",
    "educational_value": "none",
    "reasoning_indicators": "none",
    "pii_presence": "contains_pii",
    "regional_relevance": [
        "culturally_neutral"
    ],
    "country_relevance": [
        "none"
    ]
}

Throughput

The throughput results below provide a rough estimate for GPU-hours required to annotate 1M documents. After a short warmup, we run inference for 5k documents, sending 1k concurrent requests to the SGLang server.

Model GPU Docs/s hours-per-1M docs Prompt TPS Output TPS Total TPS
propella-1-4b A100 80GB 10.3 27.0 19.1k 1.5k 20.5k
propella-1-4b H100 96GB 22.4 12.4 41.6k 3.2k 44.8k
propella-1-4b (fp8) H100 96GB 27.0 10.3 50.1k 3.9k 54.0k
propella-1-1.7b A100 80GB 17.8 15.6 33.0k 2.6k 35.6k
propella-1-1.7b H100 96GB 35.8 7.8 66.5k 5.2k 71.8k
propella-1-1.7b (fp8) H100 96GB 39.1 7.1 72.7k 5.7k 78.4k
propella-1-0.6b H100 96GB 39.9 7.0 74.2k 5.7k 79.9k
propella-1-0.6b A100 80GB 21.5 12.9 40.0k 3.1k 43.1k

Evaluation

We evaluate the propella-1 models on a test set containing 3k documents. For these documents we obtain annotations from Gemini-3-Pro (reasoning_effort: high), which we consider as groundtruth labels under the assumption that they represent the upper limit in terms of annotation quality.

All baseline models use the detailed annotator system- and user-prompts as defined in propella.py. For throughput reasons, the propella-1 models use a very short, propella-1 specific prompt. We also tested some baseline models with the propella-1 prompt, always leading to worse performance as the prompt lacks details.

Metrics by Property Type

Properties are grouped into three categories, each evaluated with an appropriate metric:

  • Ordinal Properties (11 properties): QWK (Quadratic Weighted Kappa), which measures agreement while accounting for the ordinal nature of labels. It penalizes larger disagreements more heavily.
  • Binary Properties (1 property): F1, the harmonic mean of precision and recall.
  • Multi-select Properties (5 properties): IoU (Jaccard Index), intersection-over-union averaged across samples.
  • Free-text Properties (1 property): The one_sentence_description property is excluded from quantitative evaluation.

per-property-performance-plot
Click to view at full size

Overall Score

The overall score is a weighted average of the primary metric for each property type:

overall = (11/17 × avg_QWK) + (1/17 × avg_F1) + (5/17 × avg_IoU)

overall-performance-plot

Performance with fp8

propella-1 models are trained with fp8 precision and work well in both bf16 and fp8 inference modes. The plot below compares annotation quality across precisions for the 4b and 1.7b models. For the 0.6b model we recommend bf16 precision.

model bf16 score fp8 score diff
propella-1-4b 0.780 0.783 +0.38%
propella-1-1.7b 0.737 0.731 -0.81%

Language Support

The training data for propella-1 contains documents in 57 languages:

lang_script percent
eng_Latn 35.08
spa_Latn 3.98
ita_Latn 3.97
fra_Latn 3.95
deu_Latn 3.86
pol_Latn 3.81
code 2.82
math 2.77
sft 2.41
ukr_Cyrl 0.95
nld_Latn 0.95
tha_Thai 0.95
jpn_Jpan 0.94
heb_Hebr 0.94
ell_Grek 0.93
kor_Hang 0.93
isl_Latn 0.93
dan_Latn 0.92
cat_Latn 0.92
slk_Latn 0.92
rus_Cyrl 0.91
kat_Geor 0.9
por_Latn 0.9
ben_Beng 0.9
fas_Arab 0.89
ekk_Latn 0.89
fin_Latn 0.89
tur_Latn 0.89
swe_Latn 0.88
ind_Latn 0.88
ces_Latn 0.88
lit_Latn 0.88
slv_Latn 0.87
vie_Latn 0.87
eus_Latn 0.87
bul_Cyrl 0.86
mlt_Latn 0.86
lvs_Latn 0.86
nob_Latn 0.86
hun_Latn 0.85
urd_Arab 0.85
ron_Latn 0.84
glg_Latn 0.83
gle_Latn 0.83
nno_Latn 0.83
ltg_Latn 0.77
yue_Hant 0.49
cmn_Hant 0.48
hrv_Latn 0.43
arb_Arab 0.39
bos_Latn 0.39
mkd_Cyrl 0.39
srp_Latn 0.37
cmn_Hani 0.37
hin_Deva 0.36
srp_Cyrl 0.36
als_Latn 0.35
sqi_Latn 0.03
est_Latn 0.02
nor_Latn 0.02
lav_Latn 0.02
swa_Latn 0.02

Acknowledgements

  • This project is supported by the OpenEuroLLM project, co-funded by the Digital Europe Programme under GA no. 101195233. For more information see openeurollm.eu.
  • This project is supported by the LLMs4EU project, co-funded by the Digital Europe Programme under GA no. 101198470. For more information see LLMs4EU website.
  • This project is supported by the German Federal Ministry for Economic Affairs and Energy (BMWE) under the soofi (Sovereign Open Source Foundation Models for European Intelligence) project.
  • We acknowledge the EuroHPC Joint Undertaking for supporting this project through access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium, through an EuroHPC AI Factory Large Scale Access call.
  • We thank the AI Service Center for Sensitive and Critical Infrastructures (KISSKI), hosted by GWDG, for additional compute access.
eu-cofunding-logo
Downloads last month
228
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ellamind/propella-1-1.7b

Quantizations
2 models

Collection including ellamind/propella-1-1.7b