agentlans
/

multilingual-e5-small-quality-v3

Text Classification

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

multilingual-e5-small-quality-v3 / README.md

agentlans's picture

Update README.md

d92c5a7 verified 9 months ago

|

history blame contribute delete

3.45 kB

	---
	library_name: transformers
	license: mit
	base_model: agentlans/multilingual-e5-small-aligned-v2
	tags:
	- generated_from_trainer
	model-index:
	- name: multilingual-e5-small-aligned-v2-text-quality-v3
	results: []
	language:
	- multilingual
	datasets:
	- agentlans/en-translations-quality-v3
	---

	# Multilingual Text Quality Model

	This model rates the quality of non-English text for AI learning.
	Input a text string, and it outputs a numeric quality score reflecting overall informativeness and usefulness.

	## Performance

	On the evaluation set, it achieved:
	- Loss: 0.0641
	- MSE: 0.0641
	- Combined Score: 0.0641
	- Tokens processed during training: 1,109,813,760

	## Usage Example

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "agentlans/multilingual-e5-small-quality-v3"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

	# Higher scores indicate higher text quality.
	# The sign of the score has no particular meaning.
	# For example, a negative score doesn't necessarily mean that the text is low quality.
	def quality(text):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
	with torch.no_grad():
	score = model(**inputs).logits.squeeze().cpu().item()
	return score

	print(quality("Your text here."))
	```

	## Limitations

	- Works best on non-fiction and general-purpose texts.
	- Scores give an overall quality estimate but don’t explain why.
	- Unlike the other `quality-v3` models, this model is only trained on short non-English sentences.
	- Check for biases and suitability before use.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 10.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Mse \| Combined Score \| Input Tokens Seen \|
	\|:-------------:\|:-----:\|:-------:\|:---------------:\|:------:\|:--------------:\|:-----------------:\|
	\| 0.0725 \| 1.0 \| 108381 \| 0.0727 \| 0.0727 \| 0.0727 \| 110981376 \|
	\| 0.0603 \| 2.0 \| 216762 \| 0.0675 \| 0.0675 \| 0.0675 \| 221962752 \|
	\| 0.0559 \| 3.0 \| 325143 \| 0.0703 \| 0.0703 \| 0.0703 \| 332944128 \|
	\| 0.0387 \| 4.0 \| 433524 \| 0.0675 \| 0.0675 \| 0.0675 \| 443925504 \|
	\| 0.0325 \| 5.0 \| 541905 \| 0.0704 \| 0.0704 \| 0.0704 \| 554906880 \|
	\| 0.0276 \| 6.0 \| 650286 \| 0.0672 \| 0.0672 \| 0.0672 \| 665888256 \|
	\| 0.025 \| 7.0 \| 758667 \| 0.0641 \| 0.0641 \| 0.0641 \| 776869632 \|
	\| 0.0182 \| 8.0 \| 867048 \| 0.0676 \| 0.0676 \| 0.0676 \| 887851008 \|
	\| 0.0154 \| 9.0 \| 975429 \| 0.0647 \| 0.0647 \| 0.0647 \| 998832384 \|
	\| 0.0133 \| 10.0 \| 1083810 \| 0.0643 \| 0.0643 \| 0.0643 \| 1109813760 \|


	### Framework versions

	- Transformers 4.51.3
	- Pytorch 2.6.0+cu124
	- Datasets 3.2.0
	- Tokenizers 0.21.0