bge-small-en-text-quality

This model evaluates English language text quality by mapping text inputs to a continuous quality score.

It is a fine-tuned version of BAAI/bge-small-en-v1.5 optimized on the agentlans/text-quality-v3 dataset for text quality regression.

Evaluation Performance

  • Validation Loss: 0.1176
  • Mean Squared Error (MSE): 0.1176 (achieved at Epoch 2)

Model Description

This model acts as a scoring mechanism for the informational value and professional quality of English text. It outputs a single scalar value: higher scores generally indicate structured, formal, and high-information content (e.g., academic, technical, or well-structured reference text), while lower scores indicate spam, unverified chatter, or low-context snippets.

How to Use

You can use this model directly with the Hugging Face transformers library for sequence classification/regression:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
model_name = "agentlans/bge-small-en-text-quality"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Sample texts
texts = [
    "Original publication and designation: Decaisne, J. (1842). Essais sur une classification des algues...",
    "Modern Action Figure Toys & Collectibles!! You've been outbid to E****y! to YOU!"
]

# Tokenize and predict
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    # The model outputs a single continuous value per text
    scores = outputs.logits.squeeze(-1).tolist()

for text, score in zip(texts, scores):
    print(f"Score: {score:.4f} | Text: {text[:80]}...")

Intended Uses & Limitations

Primary Use Cases

  • Content Moderation: A rapid, lightweight method to filter out spam, SEO filler, and corrupted web-scraped text.
  • Data Filtering: Sorting or ranking documents by informational value before using them for LLM pre-training or RAG (Retrieval-Augmented Generation) pipelines.

Known Limitations

  • Genre Bias: The model is heavily weighted toward structured, informative prose. Subjective, opinionated, or creative writing (e.g., personal blogs, fiction, entertainment reviews) will inherently score lower regardless of their actual artistic or human value.
  • Lack of Explainability: It returns a raw numerical score without indicating why a text was deemed high or low quality.
  • No Fact-Checking Capability: The model scores structural and stylistic indicators of quality. It cannot verify the objective accuracy or truthfulness of a statement; highly structured misinformation may still receive a high rating.

Training and Evaluation Data

The model was trained and evaluated on the agentlans/text-quality-v3 dataset. Below is a sample of text inputs alongside their predicted scores versus their true dataset targets.

Sample Evaluation Results

Evaluation Results

Input Predicted Value Actual
Modern Action Figure Toys & Collectibles - Vintage, Rare and Hard to Find Toys including Convention ... -1.6846 -1.8808
Find 2 listings related to afterglow hair salon in Defuniak Springs on YP.com. See reviews, photos, ... -1.6602 -1.7688
We had a little bit of down time yesterday due to some technical difficulties. In other words Martin... -1.373 -1.403
Remember Ellen Feiss? From Apple’s ‘Switch’ campaign? THAT is awesome! Thank you so much for sharing... -1.25 -1.0043
Should we talk with U.S. Congressmen who we know support campaign finance reform? Our Theory of Chan... -0.2137 -0.6712
If you are looking for one of the best high end luxury apartment rentals in all of London I highly r... -0.5405 -0.6453
To be fair, Lebron is trying to get his teammates involved. He kinda has to play her ball here, even... -0.0917 -0.3884
Sarah Hofstetter is the global CEO at 360i, the hotshot agency that’s behind some of the most buzzed... -0.3315 -0.3483
Our Lynchburg office recently completed a retrofit roof at Carter Machinery. They installed 1 ½” ISO... -0.4929 -0.2979
A statement suspension light, the Innermost Beads Octo Pendant string together multiple spheres of g... 0.1059 -0.1651
Today I wanted to share a clip art set that isn't new but I thought you all might be working on some... -0.292 -0.1011
Chinnamasta shows us the simple, playful, and fierce truth that much of what we need is already righ... -0.2188 -0.0925
Ministers order BDS activist Lydia de Leeuw to be denied entry into Israel. Public Security Minister... -0.1429 0.0449
LJW lawyers have extensive experience litigating commercial disputes on a broad range of issues and ... 0.3911 0.3204
Dews is a professional waterman and has made a name for himself in stand up paddle boarding, surfing... 0.2075 0.3725
The festival of Lohri is Punjabis’ cultural celebration which marks the culmination of winter by lig... 0.6953 0.5373
At PEC we understand government process for Afghanistan Embassy MBBS Certificate Attestation which c... 0.6616 0.6668
If you own a beautiful hand-knotted Oriental rug, you want to keep it looking beautiful forever. The... 0.9932 1.0737
Original publication and holotype designation: Decaisne, J. (1842). Essais sur une classification de... 1.6182 1.6823
Chemical additives are used in foodstuffs and sometimes it results into adulteration due to processi... 2.418 2.4192

Training Procedure

Training Hyperparameters

  • Learning Rate: 5e-05
  • Train Batch Size: 8
  • Eval Batch Size: 8
  • Seed: 42
  • Optimizer: AdamW (adamw_torch_fused) with $\beta_1 = 0.9$, $\beta_2 = 0.999$, and $\epsilon = 10^{-8}$
  • Lr Scheduler Type: Linear
  • Num Epochs: 3.0

Epoch-by-Epoch Progress

Training Loss Epoch Step Validation Loss Mse
0.1426 1.0 10000 0.1778 0.1778
0.0895 2.0 20000 0.1176 0.1176
0.0521 3.0 30000 0.1261 0.1261

Note: The model begins over-fitting slightly by Epoch 3; the checkpoint at Epoch 2 provides the lowest Validation Loss / MSE.

Framework Versions

  • Transformers 5.0.0.dev0
  • Pytorch 2.9.1+cu128
  • Datasets 4.4.1
  • Tokenizers 0.22.1
Downloads last month
48
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for agentlans/bge-small-en-text-quality

Finetuned
(361)
this model

Dataset used to train agentlans/bge-small-en-text-quality

Evaluation results