Instructions to use agentlans/bge-small-en-text-quality with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use agentlans/bge-small-en-text-quality with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="agentlans/bge-small-en-text-quality")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("agentlans/bge-small-en-text-quality") model = AutoModelForSequenceClassification.from_pretrained("agentlans/bge-small-en-text-quality") - Notebooks
- Google Colab
- Kaggle
bge-small-en-text-quality
This model evaluates English language text quality by mapping text inputs to a continuous quality score.
It is a fine-tuned version of BAAI/bge-small-en-v1.5 optimized on the agentlans/text-quality-v3 dataset for text quality regression.
Evaluation Performance
- Validation Loss: 0.1176
- Mean Squared Error (MSE): 0.1176 (achieved at Epoch 2)
Model Description
This model acts as a scoring mechanism for the informational value and professional quality of English text. It outputs a single scalar value: higher scores generally indicate structured, formal, and high-information content (e.g., academic, technical, or well-structured reference text), while lower scores indicate spam, unverified chatter, or low-context snippets.
How to Use
You can use this model directly with the Hugging Face transformers library for sequence classification/regression:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
model_name = "agentlans/bge-small-en-text-quality"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Sample texts
texts = [
"Original publication and designation: Decaisne, J. (1842). Essais sur une classification des algues...",
"Modern Action Figure Toys & Collectibles!! You've been outbid to E****y! to YOU!"
]
# Tokenize and predict
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# The model outputs a single continuous value per text
scores = outputs.logits.squeeze(-1).tolist()
for text, score in zip(texts, scores):
print(f"Score: {score:.4f} | Text: {text[:80]}...")
Intended Uses & Limitations
Primary Use Cases
- Content Moderation: A rapid, lightweight method to filter out spam, SEO filler, and corrupted web-scraped text.
- Data Filtering: Sorting or ranking documents by informational value before using them for LLM pre-training or RAG (Retrieval-Augmented Generation) pipelines.
Known Limitations
- Genre Bias: The model is heavily weighted toward structured, informative prose. Subjective, opinionated, or creative writing (e.g., personal blogs, fiction, entertainment reviews) will inherently score lower regardless of their actual artistic or human value.
- Lack of Explainability: It returns a raw numerical score without indicating why a text was deemed high or low quality.
- No Fact-Checking Capability: The model scores structural and stylistic indicators of quality. It cannot verify the objective accuracy or truthfulness of a statement; highly structured misinformation may still receive a high rating.
Training and Evaluation Data
The model was trained and evaluated on the agentlans/text-quality-v3 dataset. Below is a sample of text inputs alongside their predicted scores versus their true dataset targets.
Sample Evaluation Results
Evaluation Results
| Input | Predicted Value | Actual |
|---|---|---|
| Modern Action Figure Toys & Collectibles - Vintage, Rare and Hard to Find Toys including Convention ... | -1.6846 | -1.8808 |
| Find 2 listings related to afterglow hair salon in Defuniak Springs on YP.com. See reviews, photos, ... | -1.6602 | -1.7688 |
| We had a little bit of down time yesterday due to some technical difficulties. In other words Martin... | -1.373 | -1.403 |
| Remember Ellen Feiss? From Apple’s ‘Switch’ campaign? THAT is awesome! Thank you so much for sharing... | -1.25 | -1.0043 |
| Should we talk with U.S. Congressmen who we know support campaign finance reform? Our Theory of Chan... | -0.2137 | -0.6712 |
| If you are looking for one of the best high end luxury apartment rentals in all of London I highly r... | -0.5405 | -0.6453 |
| To be fair, Lebron is trying to get his teammates involved. He kinda has to play her ball here, even... | -0.0917 | -0.3884 |
| Sarah Hofstetter is the global CEO at 360i, the hotshot agency that’s behind some of the most buzzed... | -0.3315 | -0.3483 |
| Our Lynchburg office recently completed a retrofit roof at Carter Machinery. They installed 1 ½” ISO... | -0.4929 | -0.2979 |
| A statement suspension light, the Innermost Beads Octo Pendant string together multiple spheres of g... | 0.1059 | -0.1651 |
| Today I wanted to share a clip art set that isn't new but I thought you all might be working on some... | -0.292 | -0.1011 |
| Chinnamasta shows us the simple, playful, and fierce truth that much of what we need is already righ... | -0.2188 | -0.0925 |
| Ministers order BDS activist Lydia de Leeuw to be denied entry into Israel. Public Security Minister... | -0.1429 | 0.0449 |
| LJW lawyers have extensive experience litigating commercial disputes on a broad range of issues and ... | 0.3911 | 0.3204 |
| Dews is a professional waterman and has made a name for himself in stand up paddle boarding, surfing... | 0.2075 | 0.3725 |
| The festival of Lohri is Punjabis’ cultural celebration which marks the culmination of winter by lig... | 0.6953 | 0.5373 |
| At PEC we understand government process for Afghanistan Embassy MBBS Certificate Attestation which c... | 0.6616 | 0.6668 |
| If you own a beautiful hand-knotted Oriental rug, you want to keep it looking beautiful forever. The... | 0.9932 | 1.0737 |
| Original publication and holotype designation: Decaisne, J. (1842). Essais sur une classification de... | 1.6182 | 1.6823 |
| Chemical additives are used in foodstuffs and sometimes it results into adulteration due to processi... | 2.418 | 2.4192 |
Training Procedure
Training Hyperparameters
- Learning Rate: 5e-05
- Train Batch Size: 8
- Eval Batch Size: 8
- Seed: 42
- Optimizer: AdamW (
adamw_torch_fused) with $\beta_1 = 0.9$, $\beta_2 = 0.999$, and $\epsilon = 10^{-8}$ - Lr Scheduler Type: Linear
- Num Epochs: 3.0
Epoch-by-Epoch Progress
| Training Loss | Epoch | Step | Validation Loss | Mse |
|---|---|---|---|---|
| 0.1426 | 1.0 | 10000 | 0.1778 | 0.1778 |
| 0.0895 | 2.0 | 20000 | 0.1176 | 0.1176 |
| 0.0521 | 3.0 | 30000 | 0.1261 | 0.1261 |
Note: The model begins over-fitting slightly by Epoch 3; the checkpoint at Epoch 2 provides the lowest Validation Loss / MSE.
Framework Versions
- Transformers 5.0.0.dev0
- Pytorch 2.9.1+cu128
- Datasets 4.4.1
- Tokenizers 0.22.1
- Downloads last month
- 48
Model tree for agentlans/bge-small-en-text-quality
Base model
BAAI/bge-small-en-v1.5Dataset used to train agentlans/bge-small-en-text-quality
Evaluation results
- MSE on agentlans/text-quality-v3self-reported0.118